You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2017/10/27 15:39:53 UTC

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/19588

    [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via handleInvalid

    ## What changes were proposed in this pull request?
    
    Support skip/error/keep strategy, similar to `StringIndexer`.
    Implemented through `try...catch`, so that it can avoid possible performance impact.
    
    ## How was this patch tested?
    
    Unit test added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark handle_invalid_for_vector_indexer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19588.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19588
    
----
commit d06399cb54ed4d2420e0982238bc1dd5f5a425bd
Author: WeichenXu <we...@databricks.com>
Date:   2017-10-27T15:32:47Z

    init pr

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Also we need jira for python.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r152442655
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
    --- End diff --
    
    Can I ask about the lazy val here? What happens here if a user sets the params, calls transform, modifies one of the params, and then calls transform again? Is that a concern?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r147542224
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
    --- End diff --
    
    We need `lazy` modifier here. Otherwise we cannot get correct `$(handleInvalid)` value in this `transformFunc`, because of var initialization order.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150771136
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
    --- End diff --
    
    I don't think changing it to a `def` is a good idea.
    Because, it will cause the whole `Model` object to be serialized.
    Current code, it will create a closure, which only involve the outer variables which are needed. It can help to avoid unexpected serialization.
    ```
    lazy val transformFunc = {
        // the returned closure `f` will only reference the following 4 outer variables.
        val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
        val localVectorMap = categoryMaps
        val localNumFeatures = numFeatures
        val localHandleInvalid = getHandleInvalid
        val f : Vector => Vector = { (v: Vector) => ...}
        f   // return closure `f`
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150748259
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
           v match {
             case dv: DenseVector =>
    +          var hasInvalid = false
               val tmpv = dv.copy
               localVectorMap.foreach { case (featureIndex: Int, categoryMap: Map[Double, Int]) =>
    -            tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex))
    +            try {
    --- End diff --
    
    Ah, that's better. Guess Xiangrui's comments become our preoccupation. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83397/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83840 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83840/testReport)** for PR 19588 at commit [`c68098a`](https://github.com/apache/spark/commit/c68098a8094a97284462ce33d589e5910d02eda3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    cc @hhbyyh @MrBago Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83126 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83126/testReport)** for PR 19588 at commit [`7cb53ba`](https://github.com/apache/spark/commit/7cb53bac12b5eb60b52578e78f6543b49b0c28db).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150718504
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala ---
    @@ -219,6 +231,33 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext
         checkCategoryMaps(densePoints2, maxCategories = 2, categoricalFeatures = Set(1, 3))
       }
     
    +  test("handle invalid") {
    +    for ((points, pointsTestInvalid) <- Seq((densePoints1, densePoints1TestInvalid),
    +      (sparsePoints1, sparsePoints1TestInvalid))) {
    +      intercept[SparkException] {
    --- End diff --
    
    Only wrap the part which should throw the exception in the intercept.  Otherwise, it might be masking some other error.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r152701758
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
    --- End diff --
    
    @WeichenXu123 Thanks for the clarification, I'm still working on understanding Params and when they can & can't be modified :).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148444535
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
    +      val exceptMsg = "VectorIndexer encountered NULL value. To handle" +
    --- End diff --
    
    And I would suggest moving the exceptMsg in case VectorIndexer.ERROR_INVALID, where it may provide some concrete    error info, like the featureIndex and unexpected value.
    Otherwise it will be very hard for the users to locate the root cause for the error case. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83840 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83840/testReport)** for PR 19588 at commit [`c68098a`](https://github.com/apache/spark/commit/c68098a8094a97284462ce33d589e5910d02eda3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148444910
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
    +      val exceptMsg = "VectorIndexer encountered NULL value. To handle" +
    +        " or skip NULLS, try setting VectorIndexer.handleInvalid."
           v match {
             case dv: DenseVector =>
    +          var hasExist = false
    --- End diff --
    
    not sure about the variableName... is it for hasInvalid?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150445283
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField, StructType}
     import org.apache.spark.util.collection.OpenHashSet
     
     /** Private trait for params for VectorIndexer and VectorIndexerModel */
    -private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol {
    +private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol
    +  with HasHandleInvalid {
    +
    +  /**
    +   * Param for how to handle invalid data (unseen labels or NULL values).
    +   * Options are 'skip' (filter out rows with invalid data),
    +   * 'error' (throw an error), or 'keep' (put invalid data in a special additional
    +   * bucket, at index numLabels).
    +   * Default: "error"
    --- End diff --
    
    What about leave case-sensitive issue into separate PR ? There're a lot of places, need similar change. We can do it in separate PR for more discussion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83391/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83391/testReport)** for PR 19588 at commit [`5117bae`](https://github.com/apache/spark/commit/5117baed21b19ccea09821995d22692a9855aa97).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150717860
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
           v match {
             case dv: DenseVector =>
    +          var hasInvalid = false
               val tmpv = dv.copy
               localVectorMap.foreach { case (featureIndex: Int, categoryMap: Map[Double, Int]) =>
    -            tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex))
    +            try {
    --- End diff --
    
    Is it really faster to use try-catch?  I've always understood it to be better to avoid Exceptions for implementing logic; e.g., here, we could check ```categoryMap.contains(tmpv(featureIndex))``` or use ```getOrElse```.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148969162
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField, StructType}
     import org.apache.spark.util.collection.OpenHashSet
     
     /** Private trait for params for VectorIndexer and VectorIndexerModel */
    -private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol {
    +private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol
    +  with HasHandleInvalid {
    +
    +  /**
    +   * Param for how to handle invalid data (unseen labels or NULL values).
    +   * Options are 'skip' (filter out rows with invalid data),
    +   * 'error' (throw an error), or 'keep' (put invalid data in a special additional
    +   * bucket, at index numLabels).
    +   * Default: "error"
    --- End diff --
    
    minor. 
    add case-sensitive.
    
    maybe break it into multiple lines:
      'skip': filter out rows with invalid data.
      'error': throw an error.
      'keep': put invalid data in a special additional bucket, at index numLabels.
    
    And numLabels is a little confusing, how about numCategories?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83126/testReport)** for PR 19588 at commit [`7cb53ba`](https://github.com/apache/spark/commit/7cb53bac12b5eb60b52578e78f6543b49b0c28db).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150760733
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
           v match {
             case dv: DenseVector =>
    +          var hasInvalid = false
               val tmpv = dv.copy
               localVectorMap.foreach { case (featureIndex: Int, categoryMap: Map[Double, Int]) =>
    -            tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex))
    +            try {
    --- End diff --
    
    But, I don't think so. JVM exception handling will be very efficient, when exception do not occur, the code with `try... catch` will have the same performance with non `try... catch` code.
    here is some explanation: https://www.quora.com/How-expensive-is-the-try-catch-block-in-Java-in-terms-of-performance


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83391/testReport)** for PR 19588 at commit [`5117bae`](https://github.com/apache/spark/commit/5117baed21b19ccea09821995d22692a9855aa97).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r152443133
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
    --- End diff --
    
    I'm not super familiar with scala closures, but if we do
    ```
    def transformFunc: Vector => Vector = {
        
        val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
        val localVectorMap = categoryMaps
        val localNumFeatures = numFeatures
        val localHandleInvalid = getHandleInvalid
        val f : Vector => Vector = { (v: Vector) => ...}
        f
    }
    ```
    Won't we still get a closure and avoid serializing the entire model?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150716852
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -37,7 +38,26 @@ import org.apache.spark.sql.types.{StructField, StructType}
     import org.apache.spark.util.collection.OpenHashSet
     
     /** Private trait for params for VectorIndexer and VectorIndexerModel */
    -private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol {
    +private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol
    +  with HasHandleInvalid {
    +
    +  /**
    +   * Param for how to handle invalid data (unseen labels or NULL values).
    --- End diff --
    
    We should note that this only applies to categorical features, not continuous ones.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150761099
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
           v match {
             case dv: DenseVector =>
    +          var hasInvalid = false
               val tmpv = dv.copy
               localVectorMap.foreach { case (featureIndex: Int, categoryMap: Map[Double, Int]) =>
    -            tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex))
    +            try {
    --- End diff --
    
    In this scenario, we can assume that, invalid values rarely occur.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83125/testReport)** for PR 19588 at commit [`d06399c`](https://github.com/apache/spark/commit/d06399cb54ed4d2420e0982238bc1dd5f5a425bd).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83125/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r151055221
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
           v match {
             case dv: DenseVector =>
    +          var hasInvalid = false
               val tmpv = dv.copy
               localVectorMap.foreach { case (featureIndex: Int, categoryMap: Map[Double, Int]) =>
    -            tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex))
    +            try {
    --- End diff --
    
    Yes. If exception occur, it will be slow, but, I think we can assume the exception will rarely occur, isn't it ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148734195
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
    +      val exceptMsg = "VectorIndexer encountered NULL value. To handle" +
    --- End diff --
    
    I think `private lazy val transformFunc: Vector => Vector` will have better performance, because this func only need to generate once.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19588


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r151029101
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
           v match {
             case dv: DenseVector =>
    +          var hasInvalid = false
               val tmpv = dv.copy
               localVectorMap.foreach { case (featureIndex: Int, categoryMap: Map[Double, Int]) =>
    -            tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex))
    +            try {
    --- End diff --
    
    The try part is fast, yet the catch part can be very slow comparably.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148440930
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField, StructType}
     import org.apache.spark.util.collection.OpenHashSet
     
     /** Private trait for params for VectorIndexer and VectorIndexerModel */
    -private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol {
    +private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol
    +  with HasHandleInvalid {
    +
    +  /**
    +   * Param for how to handle invalid data (unseen labels or NULL values).
    +   * Options are 'skip' (filter out rows with invalid data),
    +   * 'error' (throw an error), or 'keep' (put invalid data in a special additional
    +   * bucket, at index numLabels).
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
    +    "How to handle invalid data (unseen labels or NULL values). " +
    +    "Options are 'skip' (filter out rows with invalid data), error (throw an error), " +
    +    "or 'keep' (put invalid data in a special additional bucket, at index numLabels).",
    +    ParamValidators.inArray(StringIndexer.supportedHandleInvalids))
    +
    +  setDefault(handleInvalid, StringIndexer.ERROR_INVALID)
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148440879
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField, StructType}
     import org.apache.spark.util.collection.OpenHashSet
     
     /** Private trait for params for VectorIndexer and VectorIndexerModel */
    -private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol {
    +private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol
    +  with HasHandleInvalid {
    +
    +  /**
    +   * Param for how to handle invalid data (unseen labels or NULL values).
    +   * Options are 'skip' (filter out rows with invalid data),
    +   * 'error' (throw an error), or 'keep' (put invalid data in a special additional
    +   * bucket, at index numLabels).
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
    +    "How to handle invalid data (unseen labels or NULL values). " +
    +    "Options are 'skip' (filter out rows with invalid data), error (throw an error), " +
    +    "or 'keep' (put invalid data in a special additional bucket, at index numLabels).",
    +    ParamValidators.inArray(StringIndexer.supportedHandleInvalids))
    --- End diff --
    
    VectorIndexer 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r152454669
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
    --- End diff --
    
    @MrBago Use `lazy val` will generate the closure only once. If use `def`, each time call `transformFunc` will generate the closure again.
    What you concern is when user setParams for this Model, then the param won't work. But, I think this is not a issue, because `Model` do not allow user to set params, the params are copied from estimator, isn't it ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148440785
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField, StructType}
     import org.apache.spark.util.collection.OpenHashSet
     
     /** Private trait for params for VectorIndexer and VectorIndexerModel */
    -private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol {
    +private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol
    +  with HasHandleInvalid {
    +
    +  /**
    +   * Param for how to handle invalid data (unseen labels or NULL values).
    +   * Options are 'skip' (filter out rows with invalid data),
    +   * 'error' (throw an error), or 'keep' (put invalid data in a special additional
    +   * bucket, at index numLabels).
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
    +    "How to handle invalid data (unseen labels or NULL values). " +
    +    "Options are 'skip' (filter out rows with invalid data), error (throw an error), " +
    --- End diff --
    
    'error'


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148442709
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -287,9 +315,12 @@ class VectorIndexerModel private[ml] (
         while (featureIndex < numFeatures) {
           if (categoryMaps.contains(featureIndex)) {
             // categorical feature
    -        val featureValues: Array[String] =
    +        var featureValues: Array[String] =
    --- End diff --
    
    minor: this can still be a val with if else, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83766 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83766/testReport)** for PR 19588 at commit [`3c73b5d`](https://github.com/apache/spark/commit/3c73b5d8092f2c3217ed76d8d2b74038282893cd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83125/testReport)** for PR 19588 at commit [`d06399c`](https://github.com/apache/spark/commit/d06399cb54ed4d2420e0982238bc1dd5f5a425bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148444218
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
         val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
         val localVectorMap = categoryMaps
         val localNumFeatures = numFeatures
    +    val localHandleInvalid = getHandleInvalid
         val f: Vector => Vector = { (v: Vector) =>
           assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" +
             s" $numFeatures but found length ${v.size}")
    +      val exceptMsg = "VectorIndexer encountered NULL value. To handle" +
    --- End diff --
    
    I'm afraid the message here is kind of misleading. This is for the case that VectorIndexerModel met a value that not in the "training data", right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83840/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Sure, I will add python api after this is merged.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r148442070
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
    --- End diff --
    
    how about just move it into 'transform'


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    **[Test build #83766 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83766/testReport)** for PR 19588 at commit [`3c73b5d`](https://github.com/apache/spark/commit/3c73b5d8092f2c3217ed76d8d2b74038282893cd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    @hhbyyh comments addressed. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19588#discussion_r150719166
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml] (
       // TODO: Check more carefully about whether this whole class will be included in a closure.
     
       /** Per-vector transform function */
    -  private val transformFunc: Vector => Vector = {
    +  private lazy val transformFunc: Vector => Vector = {
    --- End diff --
    
    Or we could change it to a def here in order to avoid methods-in-methods.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83126/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83766/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    Python API jira created here: https://issues.apache.org/jira/browse/SPARK-22521


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/19588
  
    @WeichenXu123 when you create the JIRA for Python, can you please link it to this task's JIRA?  Thanks!
    
    LGTM
    Merging with master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org