You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/20 06:56:45 UTC

[GitHub] [spark] uchiiii opened a new pull request, #36920: [MLLIB] Modify constructor of RankingMetrics class

uchiiii opened a new pull request, #36920:
URL: https://github.com/apache/spark/pull/36920

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   - Merged the two constructor into one using `RDD[_ <: Product]`.
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   - To make code simpler.
   - To support even more inputs.
   - The previous code treats `rel` as an empty array when `rel` is not provided, which is not that beautiful. This change removes this.
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903175108


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   @zhengruifeng
   Sorry to interrupt you, but which do you think is better? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class
URL: https://github.com/apache/spark/pull/36920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903175108


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   @zhengruifeng
   Sorry for interrupting you, but which do you think is better? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r901533321


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -70,9 +62,14 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def precisionAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, _) =>
-      countRelevantItemRatio(pred, lab, k, k)
-    }.mean()
+    predictionAndLabels

Review Comment:
   Do you mean that we use `rdd` as a private variable whose type is `RDD[(Array[T], Array[T], Array[Double])]`, and keep other methods almost the same ? 



##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -70,9 +62,14 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def precisionAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, _) =>
-      countRelevantItemRatio(pred, lab, k, k)
-    }.mean()
+    predictionAndLabels

Review Comment:
   Do you mean that we use `rdd` as a private variable whose type is `RDD[(Array[T], Array[T], Array[Double])]`, and keep other methods almost the same? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r902047346


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   Which do you think is better for `ndcgAt`?
   - The previous one, where to use binary is decided based on whether `rel` is an empty array.
   - The current one, where to use binary is decided based on user input directly.
   
   IMHO, the current one is easier to understand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r906175522


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -38,16 +38,14 @@ import org.apache.spark.rdd.RDD
  *                            Since 3.4.0, it supports ndcg evaluation with relevance value.
  */
 @Since("1.2.0")
-class RankingMetrics[T: ClassTag] @Since("3.4.0") (
-    predictionAndLabels: RDD[(Array[T], Array[T], Array[Double])])
+class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: Product])
     extends Logging
     with Serializable {
 
-  @Since("1.2.0")
-  def this(predictionAndLabelsWithoutRelevance: => RDD[(Array[T], Array[T])]) = {
-    this(predictionAndLabelsWithoutRelevance.map {
-      case (pred, lab) => (pred, lab, Array.empty[Double])
-    })
+  private val rdd = predictionAndLabels.map {

Review Comment:
   How about `fullRDD`? 
   IMO, Names like `predictionsAndLabelsAndRelevances` are too long.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r906621586


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -38,16 +38,14 @@ import org.apache.spark.rdd.RDD
  *                            Since 3.4.0, it supports ndcg evaluation with relevance value.
  */
 @Since("1.2.0")
-class RankingMetrics[T: ClassTag] @Since("3.4.0") (
-    predictionAndLabels: RDD[(Array[T], Array[T], Array[Double])])
+class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: Product])
     extends Logging
     with Serializable {
 
-  @Since("1.2.0")
-  def this(predictionAndLabelsWithoutRelevance: => RDD[(Array[T], Array[T])]) = {
-    this(predictionAndLabelsWithoutRelevance.map {
-      case (pred, lab) => (pred, lab, Array.empty[Double])
-    })
+  private val rdd = predictionAndLabels.map {

Review Comment:
   maybe `predictionsLabelsRelevances`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903607449


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   Thank you for your opinion.
   
   Hm, I thought the current one may be more concise for the developers because you could easily understand the calculation process is different by input type (even though I wrote both of them).
   
   Anyway, I changed this to the previous one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903171942


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -38,16 +38,14 @@ import org.apache.spark.rdd.RDD
  *                            Since 3.4.0, it supports ndcg evaluation with relevance value.
  */
 @Since("1.2.0")
-class RankingMetrics[T: ClassTag] @Since("3.4.0") (
-    predictionAndLabels: RDD[(Array[T], Array[T], Array[Double])])
+class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: Product])
     extends Logging
     with Serializable {
 
-  @Since("1.2.0")
-  def this(predictionAndLabelsWithoutRelevance: => RDD[(Array[T], Array[T])]) = {
-    this(predictionAndLabelsWithoutRelevance.map {
-      case (pred, lab) => (pred, lab, Array.empty[Double])
-    })
+  private val rdd = predictionAndLabels.map {

Review Comment:
   We may want not to change the name because 
   - `MulticlassMetrics` also has the arguments whose name is `predictionAndLabels`.
   https://github.com/apache/spark/blob/b588d070ebc234280af730c8f9915e8859b1886e/mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala#L28-L35
   - We change the interface of the constructor to be able to support more inputs. Specific names like `predictionAndLabelsWithOptionalRelevance` may go against the goal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r902609854


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -38,16 +38,14 @@ import org.apache.spark.rdd.RDD
  *                            Since 3.4.0, it supports ndcg evaluation with relevance value.
  */
 @Since("1.2.0")
-class RankingMetrics[T: ClassTag] @Since("3.4.0") (
-    predictionAndLabels: RDD[(Array[T], Array[T], Array[Double])])
+class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: Product])
     extends Logging
     with Serializable {
 
-  @Since("1.2.0")
-  def this(predictionAndLabelsWithoutRelevance: => RDD[(Array[T], Array[T])]) = {
-    this(predictionAndLabelsWithoutRelevance.map {
-      case (pred, lab) => (pred, lab, Array.empty[Double])
-    })
+  private val rdd = predictionAndLabels.map {

Review Comment:
   Maybe we can rename this and the constructor arg; the constructor arg may also have relevance; this does not



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r902047346


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   Which do you think is better for `ndcgAt`?
   - The previous one, where to use `binary` is decided based on whether `rel` is an empty array.
   - The current one, where to use `binary` is decided based on user input directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r906167565


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -38,16 +38,14 @@ import org.apache.spark.rdd.RDD
  *                            Since 3.4.0, it supports ndcg evaluation with relevance value.
  */
 @Since("1.2.0")
-class RankingMetrics[T: ClassTag] @Since("3.4.0") (
-    predictionAndLabels: RDD[(Array[T], Array[T], Array[Double])])
+class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: Product])
     extends Logging
     with Serializable {
 
-  @Since("1.2.0")
-  def this(predictionAndLabelsWithoutRelevance: => RDD[(Array[T], Array[T])]) = {
-    this(predictionAndLabelsWithoutRelevance.map {
-      case (pred, lab) => (pred, lab, Array.empty[Double])
-    })
+  private val rdd = predictionAndLabels.map {

Review Comment:
   WDYT @uchiiii ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903175108


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   @zhengruifeng @srowen 
   Sorry to interrupt you, but which do you think is better? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r901346286


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -70,9 +62,14 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def precisionAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, _) =>
-      countRelevantItemRatio(pred, lab, k, k)
-    }.mean()
+    predictionAndLabels

Review Comment:
   I think we may create a private rdd, and than use it internally to minimize changes.
   
   something like this:
   ```
   private val rdd = predictionAndLabels.map {
       case (pred: Array[T], lab: Array[T]) => ...
       case (pred: Array[T], lab: Array[T], rel: Array[Double]) => ...
   }
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r901586489


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -70,9 +62,14 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def precisionAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, _) =>
-      countRelevantItemRatio(pred, lab, k, k)
-    }.mean()
+    predictionAndLabels

Review Comment:
   yes, hope this can reduce modifications



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r902047346


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   Which do you think is better?
   - The previous one, where to use `binary` is decided based on whether `rel` is an empty array.
   - The current one, where to use `binary` is decided based on user input directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r902047346


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   Which do you think is better for `ndcgAt`?
   - The previous one, where to use binary is decided based on whether `rel` is an empty array.
   - The current one, where to use binary is decided based on user input directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uchiiii commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
uchiiii commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903607449


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   Thank you for your opinion.
   
   hm, I thought the current one may be more concise for the developers because you could easily understand the calculation process is different by input type (even though I wrote both of them).
   
   Anyway, I changed this to the previous one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #36920:
URL: https://github.com/apache/spark/pull/36920#issuecomment-1166347050

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903179811


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -154,54 +151,68 @@ class RankingMetrics[T: ClassTag] @Since("3.4.0") (
   @Since("1.2.0")
   def ndcgAt(k: Int): Double = {
     require(k > 0, "ranking position k should be positive")
-    predictionAndLabels.map { case (pred, lab, rel) =>
-      val useBinary = rel.isEmpty
-      val labSet = lab.toSet
-      val relMap = lab.zip(rel).toMap
-      if (useBinary && lab.size != rel.size) {
-        logWarning(
-          "# of ground truth set and # of relevance value set should be equal, " +
-            "check input data")
-      }
-
-      if (labSet.nonEmpty) {
-        val labSetSize = labSet.size
-        val n = math.min(math.max(pred.length, labSetSize), k)
-        var maxDcg = 0.0
-        var dcg = 0.0
-        var i = 0
-        while (i < n) {
-          if (useBinary) {
-            // Base of the log doesn't matter for calculating NDCG,
-            // if the relevance value is binary.
-            val gain = 1.0 / math.log(i + 2)
-            if (i < pred.length && labSet.contains(pred(i))) {
-              dcg += gain
-            }
-            if (i < labSetSize) {
-              maxDcg += gain
+    predictionAndLabels
+      .map {
+        case (pred: Array[T], lab: Array[T]) =>

Review Comment:
   I think `The previous one` maybe more concise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903178123


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -38,16 +38,14 @@ import org.apache.spark.rdd.RDD
  *                            Since 3.4.0, it supports ndcg evaluation with relevance value.
  */
 @Since("1.2.0")
-class RankingMetrics[T: ClassTag] @Since("3.4.0") (
-    predictionAndLabels: RDD[(Array[T], Array[T], Array[Double])])
+class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: Product])
     extends Logging
     with Serializable {
 
-  @Since("1.2.0")
-  def this(predictionAndLabelsWithoutRelevance: => RDD[(Array[T], Array[T])]) = {
-    this(predictionAndLabelsWithoutRelevance.map {
-      case (pred, lab) => (pred, lab, Array.empty[Double])
-    })
+  private val rdd = predictionAndLabels.map {

Review Comment:
   why discarding `rel` here?
   
   If `rdd` is a `RDD[Array[T], Array[T], Array[Double]]`, then in the `ndcgAt`, we can simply check whether `rel` is empty?
   
   also, maybe we can rename `rdd` to a more meaningful name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36920:
URL: https://github.com/apache/spark/pull/36920#issuecomment-1160520565

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36920: [SPARK-39446][MLLIB][FOLLOWUP] Modify constructor of RankingMetrics class

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36920:
URL: https://github.com/apache/spark/pull/36920#discussion_r903172950


##########
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##########
@@ -38,16 +38,14 @@ import org.apache.spark.rdd.RDD
  *                            Since 3.4.0, it supports ndcg evaluation with relevance value.
  */
 @Since("1.2.0")
-class RankingMetrics[T: ClassTag] @Since("3.4.0") (
-    predictionAndLabels: RDD[(Array[T], Array[T], Array[Double])])
+class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: Product])
     extends Logging
     with Serializable {
 
-  @Since("1.2.0")
-  def this(predictionAndLabelsWithoutRelevance: => RDD[(Array[T], Array[T])]) = {
-    this(predictionAndLabelsWithoutRelevance.map {
-      case (pred, lab) => (pred, lab, Array.empty[Double])
-    })
+  private val rdd = predictionAndLabels.map {

Review Comment:
   OK that's reasonable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org