You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mengxr <gi...@git.apache.org> on 2015/04/09 02:42:40 UTC

[GitHub] spark pull request: [WIP][SPARK-5957][ML] better handling of param...

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/5431

    [WIP][SPARK-5957][ML] better handling of parameters

    The design doc was posted on the JIRA page.
    
    1. Use codegen for shared params.
    1. Move shared params to package `ml.param.shared`. 
    1. Set default values in `Params` instead of in `Param`.
    1. Add a few methods to `Params` and `ParamMap`.
    1. Move schema handling to `SchemaUtils` from `Params`.
    
    - [ ] check visibility of the methods added

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-5957

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5431.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5431
    
----
commit dcab97aa1dbf4573c3fcb379c3152bca1b0837e6
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T20:04:51Z

    add codegen for shared params

commit abb7a3bcabe9091d5b5ad493660daaa4c671f613
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T20:56:08Z

    update default values handling

commit 1c72579a731d61e2bbbccd00b4bc9358618a58da
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T21:01:31Z

    pass test compile

commit d9302b82eb2c325b15f20dd1b30bdf4c7cdc58a6
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T21:24:31Z

    generate default values

commit a9dbf59a960e1a35c0cbdadebefa65b721e73bf5
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T21:31:12Z

    minor updates

commit 0d3fc5ba1fb0cb15bcb2bda20ebf1d98cfd48243
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T21:49:04Z

    setDefault after param

commit eeeffe89409e7b97ad0e091e18f1084621543422
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T22:33:19Z

    map ++ paramMap => extractValues

commit 0d9594e41d69fc74facb95d672612b22c4f5c44c
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T22:43:59Z

    add getOrElse to ParamMap

commit 4ac6348c3211c1221d21030e13ac0f925dd81abc
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-08T23:54:42Z

    move schema utils to SchemaUtils
    add a few methods to Params

commit 48d0e843d8e1bb4d6db275316435829c81d628e3
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-09T00:13:56Z

    add remove and update explainParams

commit 94fd98e94ebf3c4460e941ad36aab711791d107d
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-09T00:21:19Z

    fix explain params

commit 29b004ca130305969f7c48285c9c17fae8f662e1
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-04-09T00:32:12Z

    update ParamsSuite

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28282519
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
    @@ -17,15 +17,16 @@
     
     package org.apache.spark.ml.classification
     
    -import org.apache.spark.annotation.{DeveloperApi, AlphaComponent}
    +import org.apache.spark.annotation.{AlphaComponent, DeveloperApi}
     import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, PredictorParams}
    -import org.apache.spark.ml.param.{Params, ParamMap, HasRawPredictionCol}
    +import org.apache.spark.ml.param.{ParamMap, Params}
    +import org.apache.spark.ml.param.shared.HasRawPredictionCol
    +import org.apache.spark.ml.util.SchemaUtils
     import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    -import org.apache.spark.sql.functions._
     import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions._
     import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
     
    -
    --- End diff --
    
    People's opinions about this differ  : )   (See the DataFrames PR)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92544002
  
      [Test build #30210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30210/consoleFull) for   PR 5431 at commit [`26ae2d7`](https://github.com/apache/spark/commit/26ae2d7898fa598e6fef8a91721992f1d7c793e5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28290778
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -325,7 +379,7 @@ class ParamMap private[ml] (private val map: mutable.Map[Param[Any], Any]) exten
       }
     
       /**
    -   * Make a copy of this param map.
    +   * Creates a copy of this param map.
        */
       def copy: ParamMap = new ParamMap(map.clone())
    --- End diff --
    
    `()` is used if it changes the content of the instance or triggers IO. Another exception is there are overridden methods and we have to use `()` in Scala. For this case, I think `copy` without `()` is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28282508
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala ---
    @@ -78,23 +81,42 @@ class ParamsSuite extends FunSuite {
       }
     
       test("params") {
    --- End diff --
    
    Add test for "params.clear"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92255674
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30150/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5957][ML] better handling of param...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-91080029
  
      [Test build #29906 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29906/consoleFull) for   PR 5431 at commit [`29b004c`](https://github.com/apache/spark/commit/29b004ca130305969f7c48285c9c17fae8f662e1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28283689
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -325,7 +379,7 @@ class ParamMap private[ml] (private val map: mutable.Map[Param[Any], Any]) exten
       }
     
       /**
    -   * Make a copy of this param map.
    +   * Creates a copy of this param map.
        */
       def copy: ParamMap = new ParamMap(map.clone())
    --- End diff --
    
    Do we like copy methods to use parentheses or not?  I think we're not completely consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5957][ML] better handling of param...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-91080774
  
      [Test build #29906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29906/consoleFull) for   PR 5431 at commit [`29b004c`](https://github.com/apache/spark/commit/29b004ca130305969f7c48285c9c17fae8f662e1).
     * This patch **fails to build**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5957][ML] better handling of param...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-91082700
  
      [Test build #29909 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29909/consoleFull) for   PR 5431 at commit [`d63b5cc`](https://github.com/apache/spark/commit/d63b5cca9aa334ab36cbd9403fd6c5e8dddd4aa6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92552866
  
      [Test build #30210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30210/consoleFull) for   PR 5431 at commit [`26ae2d7`](https://github.com/apache/spark/commit/26ae2d7898fa598e6fef8a91721992f1d7c793e5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28290532
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    +    shouldOwn(param)
    +    paramMap.get(param)
    +  }
    +
    +  /**
    +   * Clears the user-supplied value for the input param.
    +   */
    +  final def clear(param: Param[_]): this.type = {
    +    shouldOwn(param)
    +    paramMap.remove(param)
    +    this
    +  }
    +
    +  /**
    +   * Gets the value of a param in the embedded param map or its default value. Throws an exception
    +   * if neither is set.
        */
    -  protected def get[T](param: Param[T]): T = {
    -    require(param.parent.eq(this))
    -    paramMap(param)
    +  final def getOrDefault[T](param: Param[T]): T = {
    +    shouldOwn(param)
    +    get(param).orElse(getDefault(param)).get
       }
     
       /**
    -   * Internal param map.
    +   * Sets a default value. Make sure that the input param is initialized before this gets called.
        */
    -  protected val paramMap: ParamMap = ParamMap.empty
    +  protected final def setDefault[T](param: Param[T], value: T): this.type = {
    +    shouldOwn(param)
    +    defaultParamMap.put(param, value)
    +    this
    +  }
     
       /**
    -   * Check whether the given schema contains an input column.
    -   * @param colName  Input column name
    -   * @param dataType  Input column DataType
    +   * Sets default values. Make sure that the input params are initialized before this gets called.
        */
    -  protected def checkInputColumn(schema: StructType, colName: String, dataType: DataType): Unit = {
    -    val actualDataType = schema(colName).dataType
    -    require(actualDataType.equals(dataType), s"Input column $colName must be of type $dataType" +
    -      s" but was actually $actualDataType.  Column param description: ${getParam(colName)}")
    +  protected final def setDefault(paramPairs: ParamPair[_]*): this.type = {
    +    paramPairs.foreach { p =>
    +      setDefault(p.param.asInstanceOf[Param[Any]], p.value)
    +    }
    +    this
       }
     
       /**
    -   * Add an output column to the given schema.
    -   * This fails if the given output column already exists.
    -   * @param schema  Initial schema (not modified)
    -   * @param colName  Output column name.  If this column name is an empy String "", this method
    -   *                 returns the initial schema, unchanged.  This allows users to disable output
    -   *                 columns.
    -   * @param dataType  Output column DataType
    -   */
    -  protected def addOutputColumn(
    -      schema: StructType,
    -      colName: String,
    -      dataType: DataType): StructType = {
    -    if (colName.length == 0) return schema
    -    val fieldNames = schema.fieldNames
    -    require(!fieldNames.contains(colName), s"Output column $colName already exists.")
    -    val outputFields = schema.fields ++ Seq(StructField(colName, dataType, nullable = false))
    -    StructType(outputFields)
    +   * Gets the default value of a parameter.
    +   */
    +  final def getDefault[T](param: Param[T]): Option[T] = {
    +    shouldOwn(param)
    +    defaultParamMap.get(param)
    +  }
    +
    +  /**
    +   * Tests whether the input param has a default value set.
    +   */
    +  final def hasDefault[T](param: Param[T]): Boolean = {
    +    shouldOwn(param)
    +    defaultParamMap.contains(param)
    +  }
    +
    +  /**
    +   * Extracts the embedded default param values and user-supplied values, and then merges them with
    +   * extra values from input into a flat param map, where the latter value is used if there exist
    +   * conflicts.
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28282550
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    +    shouldOwn(param)
    +    paramMap.get(param)
    +  }
    +
    +  /**
    +   * Clears the user-supplied value for the input param.
    +   */
    +  final def clear(param: Param[_]): this.type = {
    +    shouldOwn(param)
    +    paramMap.remove(param)
    +    this
    +  }
    +
    +  /**
    +   * Gets the value of a param in the embedded param map or its default value. Throws an exception
    +   * if neither is set.
        */
    -  protected def get[T](param: Param[T]): T = {
    -    require(param.parent.eq(this))
    -    paramMap(param)
    +  final def getOrDefault[T](param: Param[T]): T = {
    +    shouldOwn(param)
    +    get(param).orElse(getDefault(param)).get
       }
     
       /**
    -   * Internal param map.
    +   * Sets a default value. Make sure that the input param is initialized before this gets called.
    --- End diff --
    
    "Make sure that the input param is initialized before this gets called." will be unclear to new developers.  Maybe move this to parameter-specific doc:
    ```
    @param param:  Make sure that this param is initialized before this method gets called.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5431


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92594483
  
    Merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28290818
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala ---
    @@ -0,0 +1,169 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.param.shared
    +
    +import java.io.PrintWriter
    +
    +import scala.reflect.ClassTag
    +
    +/**
    + * Code generator for shared params (sharedParams.scala). Run under the Spark folder with
    + * {{{
    + *   build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen"
    + * }}}
    + */
    +private[shared] object SharedParamsCodeGen {
    +
    +  def main(args: Array[String]): Unit = {
    +    val params = Seq(
    +      ParamDesc[Double]("regParam", "regularization parameter"),
    +      ParamDesc[Int]("maxIter", "max number of iterations"),
    +      ParamDesc[String]("featuresCol", "features column name", Some("\"features\"")),
    +      ParamDesc[String]("labelCol", "label column name", Some("\"label\"")),
    +      ParamDesc[String]("predictionCol", "prediction column name", Some("\"prediction\"")),
    +      ParamDesc[String]("rawPredictionCol", "raw prediction (a.k.a. confidence) column name",
    +        Some("\"rawPrediction\"")),
    +      ParamDesc[String]("probabilityCol",
    +        "column name for predicted class conditional probabilities", Some("\"probability\"")),
    +      ParamDesc[Double]("threshold", "threshold in prediction"),
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28290414
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    +    shouldOwn(param)
    +    paramMap.get(param)
    +  }
    +
    +  /**
    +   * Clears the user-supplied value for the input param.
    +   */
    +  final def clear(param: Param[_]): this.type = {
    +    shouldOwn(param)
    +    paramMap.remove(param)
    +    this
    +  }
    +
    +  /**
    +   * Gets the value of a param in the embedded param map or its default value. Throws an exception
    +   * if neither is set.
        */
    -  protected def get[T](param: Param[T]): T = {
    -    require(param.parent.eq(this))
    -    paramMap(param)
    +  final def getOrDefault[T](param: Param[T]): T = {
    +    shouldOwn(param)
    +    get(param).orElse(getDefault(param)).get
       }
     
       /**
    -   * Internal param map.
    +   * Sets a default value. Make sure that the input param is initialized before this gets called.
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28290995
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -55,58 +49,42 @@ class Param[T] (
        */
       def ->(value: T): ParamPair[T] = ParamPair(this, value)
     
    -  override def toString: String = {
    -    if (defaultValue.isDefined) {
    -      s"$name: $doc (default: ${defaultValue.get})"
    -    } else {
    -      s"$name: $doc"
    -    }
    -  }
    +  override def toString: String = s"$name: $doc"
    --- End diff --
    
    I like having the default.  Your call on the current value; that might actually be confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28282538
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -55,58 +49,42 @@ class Param[T] (
        */
       def ->(value: T): ParamPair[T] = ParamPair(this, value)
     
    -  override def toString: String = {
    -    if (defaultValue.isDefined) {
    -      s"$name: $doc (default: ${defaultValue.get})"
    -    } else {
    -      s"$name: $doc"
    -    }
    -  }
    +  override def toString: String = s"$name: $doc"
    --- End diff --
    
    Why not print default using parent?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28290419
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    +    shouldOwn(param)
    +    paramMap.get(param)
    +  }
    +
    +  /**
    +   * Clears the user-supplied value for the input param.
    +   */
    +  final def clear(param: Param[_]): this.type = {
    +    shouldOwn(param)
    +    paramMap.remove(param)
    +    this
    +  }
    +
    +  /**
    +   * Gets the value of a param in the embedded param map or its default value. Throws an exception
    +   * if neither is set.
        */
    -  protected def get[T](param: Param[T]): T = {
    -    require(param.parent.eq(this))
    -    paramMap(param)
    +  final def getOrDefault[T](param: Param[T]): T = {
    +    shouldOwn(param)
    +    get(param).orElse(getDefault(param)).get
       }
     
       /**
    -   * Internal param map.
    +   * Sets a default value. Make sure that the input param is initialized before this gets called.
        */
    -  protected val paramMap: ParamMap = ParamMap.empty
    +  protected final def setDefault[T](param: Param[T], value: T): this.type = {
    +    shouldOwn(param)
    +    defaultParamMap.put(param, value)
    +    this
    +  }
     
       /**
    -   * Check whether the given schema contains an input column.
    -   * @param colName  Input column name
    -   * @param dataType  Input column DataType
    +   * Sets default values. Make sure that the input params are initialized before this gets called.
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92565573
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30212/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28291065
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    --- End diff --
    
    OK, that sounds fine, including making clear() non-public


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5957][ML] better handling of param...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-91080776
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29906/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28289697
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
    @@ -17,15 +17,16 @@
     
     package org.apache.spark.ml.classification
     
    -import org.apache.spark.annotation.{DeveloperApi, AlphaComponent}
    +import org.apache.spark.annotation.{AlphaComponent, DeveloperApi}
     import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, PredictorParams}
    -import org.apache.spark.ml.param.{Params, ParamMap, HasRawPredictionCol}
    +import org.apache.spark.ml.param.{ParamMap, Params}
    +import org.apache.spark.ml.param.shared.HasRawPredictionCol
    +import org.apache.spark.ml.util.SchemaUtils
     import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    -import org.apache.spark.sql.functions._
     import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions._
     import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
     
    -
    --- End diff --
    
    No strong preference on this one:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92565550
  
      [Test build #30212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30212/consoleFull) for   PR 5431 at commit [`d19236d`](https://github.com/apache/spark/commit/d19236dd12d57ced30374e2b1ac22b720b5c42d8).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28289673
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala ---
    @@ -78,23 +81,42 @@ class ParamsSuite extends FunSuite {
       }
     
       test("params") {
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28290695
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    --- End diff --
    
    I made this change in my last commit. `set` cannot be public because `Model` also extends `Params` and some params should be immutable in models. Maybe `clear` shouldn't be public either. But for get, getDefault, and getOrDefault, I think those could be public.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28282819
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    +    shouldOwn(param)
    +    paramMap.get(param)
    +  }
    +
    +  /**
    +   * Clears the user-supplied value for the input param.
    +   */
    +  final def clear(param: Param[_]): this.type = {
    +    shouldOwn(param)
    +    paramMap.remove(param)
    +    this
    +  }
    +
    +  /**
    +   * Gets the value of a param in the embedded param map or its default value. Throws an exception
    +   * if neither is set.
        */
    -  protected def get[T](param: Param[T]): T = {
    -    require(param.parent.eq(this))
    -    paramMap(param)
    +  final def getOrDefault[T](param: Param[T]): T = {
    +    shouldOwn(param)
    +    get(param).orElse(getDefault(param)).get
       }
     
       /**
    -   * Internal param map.
    +   * Sets a default value. Make sure that the input param is initialized before this gets called.
        */
    -  protected val paramMap: ParamMap = ParamMap.empty
    +  protected final def setDefault[T](param: Param[T], value: T): this.type = {
    +    shouldOwn(param)
    +    defaultParamMap.put(param, value)
    +    this
    +  }
     
       /**
    -   * Check whether the given schema contains an input column.
    -   * @param colName  Input column name
    -   * @param dataType  Input column DataType
    +   * Sets default values. Make sure that the input params are initialized before this gets called.
        */
    -  protected def checkInputColumn(schema: StructType, colName: String, dataType: DataType): Unit = {
    -    val actualDataType = schema(colName).dataType
    -    require(actualDataType.equals(dataType), s"Input column $colName must be of type $dataType" +
    -      s" but was actually $actualDataType.  Column param description: ${getParam(colName)}")
    +  protected final def setDefault(paramPairs: ParamPair[_]*): this.type = {
    +    paramPairs.foreach { p =>
    +      setDefault(p.param.asInstanceOf[Param[Any]], p.value)
    +    }
    +    this
       }
     
       /**
    -   * Add an output column to the given schema.
    -   * This fails if the given output column already exists.
    -   * @param schema  Initial schema (not modified)
    -   * @param colName  Output column name.  If this column name is an empy String "", this method
    -   *                 returns the initial schema, unchanged.  This allows users to disable output
    -   *                 columns.
    -   * @param dataType  Output column DataType
    -   */
    -  protected def addOutputColumn(
    -      schema: StructType,
    -      colName: String,
    -      dataType: DataType): StructType = {
    -    if (colName.length == 0) return schema
    -    val fieldNames = schema.fieldNames
    -    require(!fieldNames.contains(colName), s"Output column $colName already exists.")
    -    val outputFields = schema.fields ++ Seq(StructField(colName, dataType, nullable = false))
    -    StructType(outputFields)
    +   * Gets the default value of a parameter.
    +   */
    +  final def getDefault[T](param: Param[T]): Option[T] = {
    +    shouldOwn(param)
    +    defaultParamMap.get(param)
    +  }
    +
    +  /**
    +   * Tests whether the input param has a default value set.
    +   */
    +  final def hasDefault[T](param: Param[T]): Boolean = {
    +    shouldOwn(param)
    +    defaultParamMap.contains(param)
    +  }
    +
    +  /**
    +   * Extracts the embedded default param values and user-supplied values, and then merges them with
    +   * extra values from input into a flat param map, where the latter value is used if there exist
    +   * conflicts.
    --- End diff --
    
    This is a bit confusing.  I like specifying the priority ordering:
    ```
    Priority ordering: default params < user-supplied params < extraParamMap
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28283381
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    --- End diff --
    
    Why is get public and set private?  Shouldn't both be public (which might be handy for programmatic setting of parameters, such as from config files) or both be private (since implementing classes will provide getters/setters anyways)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28289791
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -55,58 +49,42 @@ class Param[T] (
        */
       def ->(value: T): ParamPair[T] = ParamPair(this, value)
     
    -  override def toString: String = {
    -    if (defaultValue.isDefined) {
    -      s"$name: $doc (default: ${defaultValue.get})"
    -    } else {
    -      s"$name: $doc"
    -    }
    -  }
    +  override def toString: String = s"$name: $doc"
    --- End diff --
    
    I thought about this option but I'm not sure whether we should put default value and current value in `Param.toString` because they are stored in the parent. But no strong preference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92255656
  
      [Test build #30150 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30150/consoleFull) for   PR 5431 at commit [`eec2264`](https://github.com/apache/spark/commit/eec2264ef8e49a0be5f2b7539c8bba6e4279c01a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92513221
  
    @mengxr I added minor comment but don't have major ones.  That's it for now!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5957][ML] better handling of param...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-91084056
  
      [Test build #29910 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29910/consoleFull) for   PR 5431 at commit [`55be1f3`](https://github.com/apache/spark/commit/55be1f33a0176a03baf588bbd6a4addd0d645c07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28282554
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
    @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable {
       /**
        * Sets a parameter (by name) in the embedded param map.
        */
    -  private[ml] def set(param: String, value: Any): this.type = {
    +  protected final def set(param: String, value: Any): this.type = {
         set(getParam(param), value)
       }
     
       /**
    -   * Gets the value of a parameter in the embedded param map.
    +   * Optionally returns the user-supplied value of a param.
    +   */
    +  final def get[T](param: Param[T]): Option[T] = {
    +    shouldOwn(param)
    +    paramMap.get(param)
    +  }
    +
    +  /**
    +   * Clears the user-supplied value for the input param.
    +   */
    +  final def clear(param: Param[_]): this.type = {
    +    shouldOwn(param)
    +    paramMap.remove(param)
    +    this
    +  }
    +
    +  /**
    +   * Gets the value of a param in the embedded param map or its default value. Throws an exception
    +   * if neither is set.
        */
    -  protected def get[T](param: Param[T]): T = {
    -    require(param.parent.eq(this))
    -    paramMap(param)
    +  final def getOrDefault[T](param: Param[T]): T = {
    +    shouldOwn(param)
    +    get(param).orElse(getDefault(param)).get
       }
     
       /**
    -   * Internal param map.
    +   * Sets a default value. Make sure that the input param is initialized before this gets called.
        */
    -  protected val paramMap: ParamMap = ParamMap.empty
    +  protected final def setDefault[T](param: Param[T], value: T): this.type = {
    +    shouldOwn(param)
    +    defaultParamMap.put(param, value)
    +    this
    +  }
     
       /**
    -   * Check whether the given schema contains an input column.
    -   * @param colName  Input column name
    -   * @param dataType  Input column DataType
    +   * Sets default values. Make sure that the input params are initialized before this gets called.
    --- End diff --
    
    ditto: move to parameter-specific doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28283755
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala ---
    @@ -0,0 +1,169 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.param.shared
    +
    +import java.io.PrintWriter
    +
    +import scala.reflect.ClassTag
    +
    +/**
    + * Code generator for shared params (sharedParams.scala). Run under the Spark folder with
    + * {{{
    + *   build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen"
    + * }}}
    + */
    +private[shared] object SharedParamsCodeGen {
    +
    +  def main(args: Array[String]): Unit = {
    +    val params = Seq(
    +      ParamDesc[Double]("regParam", "regularization parameter"),
    +      ParamDesc[Int]("maxIter", "max number of iterations"),
    +      ParamDesc[String]("featuresCol", "features column name", Some("\"features\"")),
    +      ParamDesc[String]("labelCol", "label column name", Some("\"label\"")),
    +      ParamDesc[String]("predictionCol", "prediction column name", Some("\"prediction\"")),
    +      ParamDesc[String]("rawPredictionCol", "raw prediction (a.k.a. confidence) column name",
    +        Some("\"rawPrediction\"")),
    +      ParamDesc[String]("probabilityCol",
    +        "column name for predicted class conditional probabilities", Some("\"probability\"")),
    +      ParamDesc[Double]("threshold", "threshold in prediction"),
    --- End diff --
    
    "threshold in prediction" --> "threshold in binary classification"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92548691
  
    LGTM once tests pass


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28282559
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala ---
    @@ -0,0 +1,61 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.util
    +
    +import org.apache.spark.annotation.DeveloperApi
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +
    +/**
    + * :: DeveloperApi ::
    + * Utils for handling schemas.
    + */
    +@DeveloperApi
    +object SchemaUtils {
    +
    +  // TODO: Move the utility methods to SQL.
    +
    +  /**
    +   * Check whether the given schema contains an column of the required data type.
    --- End diff --
    
    typo: "an" --> "a"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92549813
  
      [Test build #30212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30212/consoleFull) for   PR 5431 at commit [`d19236d`](https://github.com/apache/spark/commit/d19236dd12d57ced30374e2b1ac22b720b5c42d8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5431#discussion_r28282562
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala ---
    @@ -0,0 +1,61 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.util
    +
    +import org.apache.spark.annotation.DeveloperApi
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +
    +/**
    + * :: DeveloperApi ::
    + * Utils for handling schemas.
    + */
    +@DeveloperApi
    +object SchemaUtils {
    +
    +  // TODO: Move the utility methods to SQL.
    +
    +  /**
    +   * Check whether the given schema contains an column of the required data type.
    +   * @param colName  column name
    +   * @param dataType  required column data type
    +   */
    +  def checkColumnType(schema: StructType, colName: String, dataType: DataType): Unit = {
    +    val actualDataType = schema(colName).dataType
    +    require(actualDataType.equals(dataType),
    +      s"Column $colName must be of type $dataType but was actually $actualDataType.")
    +  }
    +
    +  /**
    +   * Appends a new column to the input schema. This fails if the given output column already exists.
    +   * @param schema input schema
    +   * @param colName new column name. If this column name is en empty string "", this method returns
    --- End diff --
    
    typo: "en" -> "an"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92230105
  
      [Test build #30150 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30150/consoleFull) for   PR 5431 at commit [`eec2264`](https://github.com/apache/spark/commit/eec2264ef8e49a0be5f2b7539c8bba6e4279c01a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92509721
  
    Should this method in Params be made abstract?
    ```
    def validate(paramMap: ParamMap): Unit = {}
    ```
    I just realized we haven't been using it, and making it abstract would force users to think about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5431#issuecomment-92552876
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30210/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org