You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gaborgsomogyi <gi...@git.apache.org> on 2018/01/23 12:13:11 UTC

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

GitHub user gaborgsomogyi opened a pull request:

    https://github.com/apache/spark/pull/20362

    [Spark-22886][ML][TESTS] ML test for structured streaming: ml.recomme…

    ## What changes were proposed in this pull request?
    
    Converting spark.ml.recommendation tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882.
    
    ## How was this patch tested?
    
    Automated: Pass the Jenkins.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gaborgsomogyi/spark SPARK-22886

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20362.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20362
    
----
commit 33654c93c2fe240eb0c6a6932353239ab84b0ce0
Author: Gabor Somogyi <ga...@...>
Date:   2018-01-18T20:27:08Z

    [Spark-22886][ML][TESTS] ML test for structured streaming: ml.recommendation

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    **[Test build #4126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4126/testReport)** for PR 20362 at commit [`acfb092`](https://github.com/apache/spark/commit/acfb0920d0c5ae4601580cec633f2c3b87a2673f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    **[Test build #4126 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4126/testReport)** for PR 20362 at commit [`acfb092`](https://github.com/apache/spark/commit/acfb0920d0c5ae4601580cec633f2c3b87a2673f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165378573
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -566,6 +565,7 @@ class ALSSuite
       test("read/write") {
         val spark = this.spark
         import spark.implicits._
    +
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r167005865
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -599,8 +598,11 @@ class ALSSuite
               (ex, act) =>
                 ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1)
             } { (ex, act, _) =>
    -          ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~==
    -            act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6
    +          testTransformerByGlobalCheckFunc[Float](_: DataFrame, act, "prediction") {
    +            case actRows: Seq[Row] =>
    +              ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~==
    +                actRows(0).getDouble(0) absTol 1e-6
    +          }
    --- End diff --
    
    Woah, didn't check the original functionality in such a depth. This is really dead code in runtime environment. Fixed 👍 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165396480
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -662,28 +676,32 @@ class ALSSuite
         val knownItem = data.select(max("item")).as[Int].first()
         val unknownItem = knownItem + 20
         val test = Seq(
    -      (unknownUser, unknownItem),
    -      (knownUser, unknownItem),
    -      (unknownUser, knownItem),
    -      (knownUser, knownItem)
    -    ).toDF("user", "item")
    +      (unknownUser, unknownItem, true),
    +      (knownUser, unknownItem, true),
    +      (unknownUser, knownItem, true),
    +      (knownUser, knownItem, false)
    +    ).toDF("user", "item", "expectedIsNaN")
     
         val als = new ALS().setMaxIter(1).setRank(1)
         // default is 'nan'
         val defaultModel = als.fit(data)
    -    val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect()
    -    assert(defaultPredictions.length == 4)
    -    assert(defaultPredictions.slice(0, 3).forall(_.isNaN))
    -    assert(!defaultPredictions.last.isNaN)
    +    var defaultPredictionNotNaN = Float.NaN
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by smurakozi <gi...@git.apache.org>.

Github user smurakozi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165312057
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -662,28 +676,32 @@ class ALSSuite
         val knownItem = data.select(max("item")).as[Int].first()
         val unknownItem = knownItem + 20
         val test = Seq(
    -      (unknownUser, unknownItem),
    -      (knownUser, unknownItem),
    -      (unknownUser, knownItem),
    -      (knownUser, knownItem)
    -    ).toDF("user", "item")
    +      (unknownUser, unknownItem, true),
    +      (knownUser, unknownItem, true),
    +      (unknownUser, knownItem, true),
    +      (knownUser, knownItem, false)
    +    ).toDF("user", "item", "expectedIsNaN")
     
         val als = new ALS().setMaxIter(1).setRank(1)
         // default is 'nan'
         val defaultModel = als.fit(data)
    -    val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect()
    -    assert(defaultPredictions.length == 4)
    -    assert(defaultPredictions.slice(0, 3).forall(_.isNaN))
    -    assert(!defaultPredictions.last.isNaN)
    +    var defaultPredictionNotNaN = Float.NaN
    --- End diff --
    
    I would get rid of this variable. 
    In `testTransformer` it just adds overhead, `assert(!defaultPredictionNotNaN.isNaN)` asserts something that was already checked in testTransformer, so it's only use is in `testTransformerByGlobalCheckFunc`.
    Producing it is a bit convoluted, it's not easy to understand why it's needed. 
    I would make it clearer by doing a plain old transform using the `test` DF (or a smaller one containing only the knownUser, knownItem pair) and selecting the value.
    An alternative solution could be to use real expected values in the `test` DF instead of "isNan" flags. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by smurakozi <gi...@git.apache.org>.

Github user smurakozi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165305989
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -599,8 +599,15 @@ class ALSSuite
               (ex, act) =>
                 ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1)
             } { (ex, act, _) =>
    -          ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~==
    -            act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6
    +          testTransformerByGlobalCheckFunc[Float](_: DataFrame, ex, "prediction") {
    +            case exRows: Seq[Row] =>
    --- End diff --
    
    I think it's ok to keep ex.transform here. This way the code will be a bit simpler.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r170072578
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -413,34 +411,36 @@ class ALSSuite
           .setSeed(0)
         val alpha = als.getAlpha
         val model = als.fit(training.toDF())
    -    val predictions = model.transform(test.toDF()).select("rating", "prediction").rdd.map {
    -      case Row(rating: Float, prediction: Float) =>
    -        (rating.toDouble, prediction.toDouble)
    +    testTransformerByGlobalCheckFunc[Rating[Int]](test.toDF(), model, "rating", "prediction") {
    +        case rows: Seq[Row] =>
    +          val predictions = rows.map(row => (row.getFloat(0).toDouble, row.getFloat(1).toDouble))
    +
    +          val rmse =
    +            if (implicitPrefs) {
    +              // TODO: Use a better (rank-based?) evaluation metric for implicit feedback.
    +              // We limit the ratings and the predictions to interval [0, 1] and compute the
    +              // weighted RMSE with the confidence scores as weights.
    +              val (totalWeight, weightedSumSq) = predictions.map { case (rating, prediction) =>
    +                val confidence = 1.0 + alpha * math.abs(rating)
    +                val rating01 = math.max(math.min(rating, 1.0), 0.0)
    +                val prediction01 = math.max(math.min(prediction, 1.0), 0.0)
    +                val err = prediction01 - rating01
    +                (confidence, confidence * err * err)
    +              }.reduce[(Double, Double)] { case ((c0, e0), (c1, e1)) =>
    +                (c0 + c1, e0 + e1)
    +              }
    +              math.sqrt(weightedSumSq / totalWeight)
    +            } else {
    +              val errorSquares = predictions.map { case (rating, prediction) =>
    +                val err = rating - prediction
    +                err * err
    +              }
    +              val mse = errorSquares.sum / errorSquares.length
    +              math.sqrt(mse)
    +            }
    +          logInfo(s"Test RMSE is $rmse.")
    +          assert(rmse < targetRMSE)
         }
    -    val rmse =
    --- End diff --
    
    Mainly move but there was no mean function so implemented.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r170046788
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -586,6 +586,68 @@ class ALSSuite
           allModelParamSettings, checkModelData)
       }
     
    +  private def checkNumericTypesALS(
    --- End diff --
    
    Is this mostly just moving the code from the test class, or did it change too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20362


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165379876
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -599,8 +599,15 @@ class ALSSuite
               (ex, act) =>
                 ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1)
             } { (ex, act, _) =>
    -          ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~==
    -            act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6
    +          testTransformerByGlobalCheckFunc[Float](_: DataFrame, ex, "prediction") {
    +            case exRows: Seq[Row] =>
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    gentle ping @jkbradley @WeichenXu123


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    ping @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by smurakozi <gi...@git.apache.org>.

Github user smurakozi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165308205
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -653,6 +666,7 @@ class ALSSuite
       test("ALS cold start user/item prediction strategy") {
         val spark = this.spark
         import spark.implicits._
    +
    --- End diff --
    
    nit: no need for empty line here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    cc @jkbradley @WeichenXu123 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by smurakozi <gi...@git.apache.org>.

Github user smurakozi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165304423
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -566,6 +565,7 @@ class ALSSuite
       test("read/write") {
         val spark = this.spark
         import spark.implicits._
    +
    --- End diff --
    
    nit: new line is not needed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by smurakozi <gi...@git.apache.org>.

Github user smurakozi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165745445
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -599,8 +598,11 @@ class ALSSuite
               (ex, act) =>
                 ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1)
             } { (ex, act, _) =>
    -          ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~==
    -            act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6
    +          testTransformerByGlobalCheckFunc[Float](_: DataFrame, act, "prediction") {
    +            case actRows: Seq[Row] =>
    +              ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~==
    +                actRows(0).getDouble(0) absTol 1e-6
    +          }
    --- End diff --
    
    I think this code does not check anything. 
    `testTransformerByGlobalCheckFunc[Float](_: DataFrame, act, "prediction")` is just a partial application of `testTransformerByGlobalCheckFunc`. 
    However,  `checkNumericTypesALS` expects `check2: (ALSModel, ALSModel, DataFrame) => Unit`. It's happy to call the provided function, discard the partially applied function and use `()` instead, so it will typecheck.
    As a consequence, the function doing the assert is never called, so the `~===` assertion never happens. You can check it say by asking for the 100th column of the first row - it will not produce an error.
    
    This problem is not a result of your change, the original code had the same issue.
    
    It could probably be simplified a bit but I think the original intent was to do a check like this:
    
    ```
    { (ex, act, df) =>
              ex.transform(df).selectExpr("cast(prediction as double)").first.getDouble(0) ~==
                act.transform(df).selectExpr("cast(prediction as double)").first.getDouble(0) absTol
                  1e-6
            }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by smurakozi <gi...@git.apache.org>.

Github user smurakozi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165308129
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -628,18 +635,24 @@ class ALSSuite
         }
         withClue("transform should fail when ids exceed integer range. ") {
           val model = als.fit(df)
    -      assert(intercept[SparkException] {
    -        model.transform(df.select(df("user_big").as("user"), df("item"))).first
    -      }.getMessage.contains(msg))
    -      assert(intercept[SparkException] {
    -        model.transform(df.select(df("user_small").as("user"), df("item"))).first
    -      }.getMessage.contains(msg))
    -      assert(intercept[SparkException] {
    -        model.transform(df.select(df("item_big").as("item"), df("user"))).first
    -      }.getMessage.contains(msg))
    -      assert(intercept[SparkException] {
    -        model.transform(df.select(df("item_small").as("item"), df("user"))).first
    -      }.getMessage.contains(msg))
    +      def testTransformIdExceedsIntRange[A : Encoder](dataFrame: DataFrame): Unit = {
    +        assert(intercept[SparkException] {
    +          model.transform(dataFrame).first
    +        }.getMessage.contains(msg))
    +        assert(intercept[StreamingQueryException] {
    +          testTransformer[A](dataFrame, model, "prediction") {
    +            case _ =>
    --- End diff --
    
    No need for a partial function here, you can simplify it to `{ _ => }`. 
    I would also add a small comment to make it explicit that we intentionally do not check anything.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165392031
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -693,7 +711,9 @@ class ALSSuite
         val data = ratings.toDF
         val model = new ALS().fit(data)
         Seq("nan", "NaN", "Nan", "drop", "DROP", "Drop").foreach { s =>
    -      model.setColdStartStrategy(s).transform(data)
    +      testTransformer[Rating[Int]](data, model.setColdStartStrategy(s), "prediction") {
    +        case _ =>
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    @smurakozi fix added for the original test issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165382316
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -628,18 +635,24 @@ class ALSSuite
         }
         withClue("transform should fail when ids exceed integer range. ") {
           val model = als.fit(df)
    -      assert(intercept[SparkException] {
    -        model.transform(df.select(df("user_big").as("user"), df("item"))).first
    -      }.getMessage.contains(msg))
    -      assert(intercept[SparkException] {
    -        model.transform(df.select(df("user_small").as("user"), df("item"))).first
    -      }.getMessage.contains(msg))
    -      assert(intercept[SparkException] {
    -        model.transform(df.select(df("item_big").as("item"), df("user"))).first
    -      }.getMessage.contains(msg))
    -      assert(intercept[SparkException] {
    -        model.transform(df.select(df("item_small").as("item"), df("user"))).first
    -      }.getMessage.contains(msg))
    +      def testTransformIdExceedsIntRange[A : Encoder](dataFrame: DataFrame): Unit = {
    +        assert(intercept[SparkException] {
    +          model.transform(dataFrame).first
    +        }.getMessage.contains(msg))
    +        assert(intercept[StreamingQueryException] {
    +          testTransformer[A](dataFrame, model, "prediction") {
    +            case _ =>
    --- End diff --
    
    Partial function removed. This code part expects `StreamingQueryException` which is quite close to this area. Not sure whether a comment would make it better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by smurakozi <gi...@git.apache.org>.

Github user smurakozi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165312157
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -693,7 +711,9 @@ class ALSSuite
         val data = ratings.toDF
         val model = new ALS().fit(data)
         Seq("nan", "NaN", "Nan", "drop", "DROP", "Drop").foreach { s =>
    -      model.setColdStartStrategy(s).transform(data)
    +      testTransformer[Rating[Int]](data, model.setColdStartStrategy(s), "prediction") {
    +        case _ =>
    --- End diff --
    
    Just like above, no need for partial function.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r170047180
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -693,7 +766,7 @@ class ALSSuite
         val data = ratings.toDF
         val model = new ALS().fit(data)
         Seq("nan", "NaN", "Nan", "drop", "DROP", "Drop").foreach { s =>
    -      model.setColdStartStrategy(s).transform(data)
    +      testTransformer[Rating[Int]](data, model.setColdStartStrategy(s), "prediction") { _ => }
    --- End diff --
    
    What's the no-op function at the end for -- just because it requires an argument?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r165382596
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -653,6 +666,7 @@ class ALSSuite
       test("ALS cold start user/item prediction strategy") {
         val spark = this.spark
         import spark.implicits._
    +
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    Nice work! 👍


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r170074167
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -586,6 +586,68 @@ class ALSSuite
           allModelParamSettings, checkModelData)
       }
     
    +  private def checkNumericTypesALS(
    --- End diff --
    
    It's changed because one of the test hasn't done anything. Please take a look at the last commit it contains a test bugfix not only the transformation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r170046857
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -413,34 +411,36 @@ class ALSSuite
           .setSeed(0)
         val alpha = als.getAlpha
         val model = als.fit(training.toDF())
    -    val predictions = model.transform(test.toDF()).select("rating", "prediction").rdd.map {
    -      case Row(rating: Float, prediction: Float) =>
    -        (rating.toDouble, prediction.toDouble)
    +    testTransformerByGlobalCheckFunc[Rating[Int]](test.toDF(), model, "rating", "prediction") {
    +        case rows: Seq[Row] =>
    +          val predictions = rows.map(row => (row.getFloat(0).toDouble, row.getFloat(1).toDouble))
    +
    +          val rmse =
    +            if (implicitPrefs) {
    +              // TODO: Use a better (rank-based?) evaluation metric for implicit feedback.
    +              // We limit the ratings and the predictions to interval [0, 1] and compute the
    +              // weighted RMSE with the confidence scores as weights.
    +              val (totalWeight, weightedSumSq) = predictions.map { case (rating, prediction) =>
    +                val confidence = 1.0 + alpha * math.abs(rating)
    +                val rating01 = math.max(math.min(rating, 1.0), 0.0)
    +                val prediction01 = math.max(math.min(prediction, 1.0), 0.0)
    +                val err = prediction01 - rating01
    +                (confidence, confidence * err * err)
    +              }.reduce[(Double, Double)] { case ((c0, e0), (c1, e1)) =>
    +                (c0 + c1, e0 + e1)
    +              }
    +              math.sqrt(weightedSumSq / totalWeight)
    +            } else {
    +              val errorSquares = predictions.map { case (rating, prediction) =>
    +                val err = rating - prediction
    +                err * err
    +              }
    +              val mse = errorSquares.sum / errorSquares.length
    +              math.sqrt(mse)
    +            }
    +          logInfo(s"Test RMSE is $rmse.")
    +          assert(rmse < targetRMSE)
         }
    -    val rmse =
    --- End diff --
    
    This change is just a move, really? or did something else change as well?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...

Posted by gaborgsomogyi <gi...@git.apache.org>.

Github user gaborgsomogyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20362#discussion_r170072392
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -586,6 +586,68 @@ class ALSSuite
           allModelParamSettings, checkModelData)
       }
     
    +  private def checkNumericTypesALS(
    --- End diff --
    
    Mainly move but there was no mean function so implemented this oneliner.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20362: [Spark-22886][ML][TESTS] ML test for structured streamin...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20362
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org