You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/03/10 13:28:42 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #40367: [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT

zhengruifeng opened a new pull request, #40367:
URL: https://github.com/apache/spark/pull/40367

   ### What changes were proposed in this pull request?
   Add a hook `onParamChange` in `Params.{set, setDefault, clear}`, so that subclass can update the internal status within it.
   
   ### Why are the changes needed?
   In 3.1, we added internal auxiliary variables in LoR and AFT to optimize prediction/transformation.
   In LoR, when users call `model.{setThreshold, setThresholds}`, the internal status will be correctly updated.
   
   But users still can call `model.set(model.threshold, value)`, then the status will not be updated.
   And when users call `model.clear(model.threshold)`, the status should be updated with default threshold value 0.5.
   
   for example:
   ```
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.classification._
   
   val df = Seq((1.0, 1.0, Vectors.dense(0.0, 5.0)), (0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0, 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0))).toDF("label", "weight", "features")
   
   val lor = new LogisticRegression().setWeightCol("weight")
   val model = lor.fit(df)
   
   val vec = Vectors.dense(0.0, 5.0)
   
   val p0 = model.predict(vec)                                               # return 0.0
   model.setThreshold(0.05)                                                 # change status
   val p1 = model.set(model.threshold, 0.5).predict(vec)   # return 1.0; but should be 0.0
   val p2 = model.clear(model.threshold).predict(vec)       # return 1.0; but should be 0.0
   ```
   
   what makes it even worse it that `pyspark.ml` always set params via `model.set(model.threshold, value)`, so the internal status is easily out of sync, see the example in [SPARK-42747](https://issues.apache.org/jira/browse/SPARK-42747)
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   added ut


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40367: [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40367:
URL: https://github.com/apache/spark/pull/40367#issuecomment-1465353652

   @srowen Thank you for the reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40367: [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40367:
URL: https://github.com/apache/spark/pull/40367#discussion_r1132385670


##########
mllib/src/main/scala/org/apache/spark/ml/param/params.scala:
##########
@@ -879,15 +882,8 @@ trait Params extends Identifiable with Serializable {
     }
     to
   }
-}
 
-private[ml] object Params {
-  /**
-   * Sets a default param value for a `Params`.
-   */
-  private[ml] final def setDefault[T](params: Params, param: Param[T], value: T): Unit = {
-    params.defaultParamMap.put(param -> value)

Review Comment:
   should avoid directly updating `defaultParamMap`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #40367: [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen closed pull request #40367: [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT
URL: https://github.com/apache/spark/pull/40367


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40367: [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40367:
URL: https://github.com/apache/spark/pull/40367#issuecomment-1463813204

   cc @WeichenXu123 @srowen @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #40367: [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #40367:
URL: https://github.com/apache/spark/pull/40367#issuecomment-1464926814

   Merged to master/3.4/3.3/3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org