You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by coderxiang <gi...@git.apache.org> on 2014/10/06 07:34:35 UTC

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

GitHub user coderxiang opened a pull request:

    https://github.com/apache/spark/pull/2667

    SPARK-3568 [mllib] add ranking metrics

    Add common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including:
     - Mean Average Precision
     - Precision@n: top-n precision
     - Discounted cumulative gain (DCG) and NDCG 
    
    The following methods and the corresponding tests are implemented:
    
    ```
    class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
      /* Returns the precsion@k for each query */
      lazy val precAtK: RDD[Array[Double]]
    
      /* Returns the average precision for each query */
      lazy val avePrec: RDD[Double]
    
      /*Returns the mean average precision (MAP) of all the queries*/
      lazy val meanAvePrec: Double
    
      /*Returns the normalized discounted cumulative gain for each query */
      lazy val ndcg: RDD[Double]
    
      /* Returns the mean NDCG of all the queries */
      lazy val meanNdcg: Double
    }
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/coderxiang/spark rankingmetrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2667
    
----
commit 3a5a6ffdb036f8432911184920193a4b8a007084
Author: coderxiang <sh...@gmail.com>
Date:   2014-10-06T05:28:05Z

    add ranking metrics

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by coderxiang <gi...@git.apache.org>.

Github user coderxiang commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59304592
  
    Jenkis, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19046940
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    --- End diff --
    
    What if the label set contains less than k items? It is worth documenting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59808963
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21920/consoleFull) for   PR 2667 at commit [`d64c120`](https://github.com/apache/spark/commit/d64c1201e439d2894a76196659c59b9abb03be5e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129854
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    +   * ground truth set is less than k.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision, must be positive
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    --- End diff --
    
    `require(k > 0, "ranking ...` (remove space before `(` and add space after `,`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18876036
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,100 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    --- End diff --
    
    btw, I think it is better to use `precisionAt` instead of `precision` to emphasize that the parameter is a position. We use `precision(class)` in `MulticlassMetrics`, and users may get confused if we use the same name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18875998
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet = lab.toSet
    +    val n = pred.length
    +    val topKPrec = Array.fill[Double](n)(0.0)
    +    var (i, cnt) = (0, 0)
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      topKPrec(i) = cnt.toDouble / (i + 1)
    +      i += 1
    +    }
    +    topKPrec
    +  }
    +
    +  /**
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precision(k: Int): Double = precAtK.map {topKPrec =>
    +    val n = topKPrec.length
    +    if (k <= n) {
    +      topKPrec(k - 1)
    +    } else {
    +      topKPrec(n - 1) * n / k
    +    }
    +  }.mean
    +
    +  /**
    +   * Returns the average precision for each query
    +   */
    +  lazy val avePrec: RDD[Double] = predictionAndLabels.map {case (pred, lab) =>
    --- End diff --
    
    Same as `precAtK`. We should make it private or put it inside `meanAveragePrecision`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58572854
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18876002
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet = lab.toSet
    +    val n = pred.length
    +    val topKPrec = Array.fill[Double](n)(0.0)
    +    var (i, cnt) = (0, 0)
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      topKPrec(i) = cnt.toDouble / (i + 1)
    +      i += 1
    +    }
    +    topKPrec
    +  }
    +
    +  /**
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precision(k: Int): Double = precAtK.map {topKPrec =>
    +    val n = topKPrec.length
    +    if (k <= n) {
    +      topKPrec(k - 1)
    +    } else {
    +      topKPrec(n - 1) * n / k
    +    }
    +  }.mean
    +
    +  /**
    +   * Returns the average precision for each query
    +   */
    +  lazy val avePrec: RDD[Double] = predictionAndLabels.map {case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var (i, cnt, precSum) = (0, 0, 0.0)
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAvePrec: Double = avePrec.mean
    +
    +  /**
    +   * Returns the normalized discounted cumulative gain for each query
    +   */
    +  lazy val ndcgAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab) =>
    --- End diff --
    
    Ditto. Make it private or put it inside `ndcgAt(k)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18468077
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    --- End diff --
    
    Hm, are these IDs of some kind then? the example makes them look like rankings, but maybe that's coincidence. They shouldn't be labels, right? Because the resulting set would almost always be {0.0,1.0}. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18875997
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet = lab.toSet
    +    val n = pred.length
    +    val topKPrec = Array.fill[Double](n)(0.0)
    +    var (i, cnt) = (0, 0)
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      topKPrec(i) = cnt.toDouble / (i + 1)
    +      i += 1
    +    }
    +    topKPrec
    +  }
    +
    +  /**
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precision(k: Int): Double = precAtK.map {topKPrec =>
    --- End diff --
    
    If `k` is given, we only need to compute `topKPrec(k-1)` for each record. We can do
    
    1. compute on-demand, every call requires a pass.
    2. compute average `prec@k` for all possible k and cache the result. Then simply return the cached result.
    
    The first approach may be good for this PR. We can optimize it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18875991
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,100 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    --- End diff --
    
    Should it be private? Think about what users call it for. Even we make it public, users may still need some aggregate metrics out of it. For the first version, I think it is safe to provide only the following:
    
    1. precisionAt(k: Int): Double
    2. ndcgAt(k: Int): Double
    3. meanAveragePrecision
    
    `{case (pred, lab)=>` -> `{ case (pred, lab) =>`` (spaces after `{` and `)`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18447432
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    --- End diff --
    
    Might check that arguments are not empty and of equal length?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129866
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    +   * ground truth set is less than k.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision, must be positive
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val n = math.min(pred.length, k)
    +      var i = 0
    +      var cnt = 0
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +        }
    +        i += 1
    +      }
    +      if (labSet.size == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        cnt.toDouble / k
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries.
    +   * If a query has an empty ground truth set, the average precision will be zero and a log
    +   * warining is generated.
    +   */
    +  lazy val meanAveragePrecision: Double = {
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val labSetSize = labSet.size
    +      var i = 0
    +      var cnt = 0
    +      var precSum = 0.0
    +      val n = pred.length
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +          precSum += cnt.toDouble / (i + 1)
    +        }
    +        i += 1
    +      }
    +      if (labSetSize == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        precSum / labSet.size
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * The discounted cumulative gain at position k is computed as:
    +   *    \sum_{i=1}^k (2^{relevance of ith item} - 1) / log(i + 1),
    +   * and the NDCG is obtained by dividing the DCG value on the ground truth set. In the current
    +   * implementation, the relevance value is binary.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at position n
    +   * will be used. If the ground truth set contains n (n < k) results, the first n items will be
    +   * used to compute the DCG value on the ground truth set.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated ndcg, must be positive
    +   * @return the average ndcg at the first k ranking positions
    +   */
    +  def ndcgAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    --- End diff --
    
    `require(k > 0, "ranking ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-60010395
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22009/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19046955
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, k)
    +    var i = 0
    +    var cnt = 0
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      i += 1
    +    }
    +    cnt.toDouble / k
    +  }.mean
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var i = 0
    +    var cnt = 0
    +    var precSum = 0.0
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }.mean
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at
    +   * at position n will be used. See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated ndcg
    +   * @return the average ndcg at the first k ranking positions
    +   */
    +  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val labSetSize = labSet.size
    +    val n = math.min(math.max(pred.length, labSetSize), k)
    +    var maxDcg = 0.0
    +    var dcg = 0.0
    +    var i = 0
    +
    +    while (i < n) {
    +      // Calculate 1/log2(i + 2)
    +      val gain = math.log(2) / math.log(i + 2)
    --- End diff --
    
    `math.log(2)` -> `1.0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59794008
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-57979724
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21318/consoleFull) for   PR 2667 at commit [`3a5a6ff`](https://github.com/apache/spark/commit/3a5a6ffdb036f8432911184920193a4b8a007084).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) `
      * `case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)`
      * `case class UncacheTableCommand(tableName: String) extends Command`
      * `case class CacheTableCommand(`
      * `case class UncacheTableCommand(tableName: String) extends LeafNode with Command `
      * `case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18447094
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet : Set[Double] = lab.toSet
    +    val n = pred.length
    +    val topkPrec = Array.fill[Double](n)(.0)
    +    var (i, cnt) = (0, 0)
    --- End diff --
    
    `0.0` instead of `.0`? And I am not sure it is helpful to init 2 or more variables in a line using a tuple.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129853
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    +   * ground truth set is less than k.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    --- End diff --
    
    `returned` -> `used as precision`. We don't `return` zero.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18554098
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    --- End diff --
    
    +1 on @srowen 's suggestion. We can use a generic type here with a ClassTag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129863
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    +   * ground truth set is less than k.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision, must be positive
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val n = math.min(pred.length, k)
    +      var i = 0
    +      var cnt = 0
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +        }
    +        i += 1
    +      }
    +      if (labSet.size == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        cnt.toDouble / k
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries.
    +   * If a query has an empty ground truth set, the average precision will be zero and a log
    +   * warining is generated.
    +   */
    +  lazy val meanAveragePrecision: Double = {
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val labSetSize = labSet.size
    +      var i = 0
    +      var cnt = 0
    +      var precSum = 0.0
    +      val n = pred.length
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +          precSum += cnt.toDouble / (i + 1)
    +        }
    +        i += 1
    +      }
    +      if (labSetSize == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        precSum / labSet.size
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * The discounted cumulative gain at position k is computed as:
    +   *    \sum_{i=1}^k (2^{relevance of ith item} - 1) / log(i + 1),
    +   * and the NDCG is obtained by dividing the DCG value on the ground truth set. In the current
    +   * implementation, the relevance value is binary.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at position n
    +   * will be used. If the ground truth set contains n (n < k) results, the first n items will be
    +   * used to compute the DCG value on the ground truth set.
    --- End diff --
    
    This paragraph is not necessary because those cases are compatible with the definition of NDCG.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18447056
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    --- End diff --
    
    The inputs are really ranks, right? would this not be more natural as `Int` then?
    I might have expected that the inputs were instead predicted and ground truth "scores" instead, in which case `Double` makes sense. But then the methods need to convert to rankings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-60001359
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22009/consoleFull) for   PR 2667 at commit [`d881097`](https://github.com/apache/spark/commit/d88109767c1ffd853b2b03c33c8d722353fd39ce).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59468584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21842/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58431272
  
    @srowen Making `k` explicit is what I suggested to @coderxiang . This should clearly tell the difference from other metrics. F-measure may not be useful here, nor recall. Let's focus on `ndcg@k`, `prec@k`, and `MAP` in this PR. @coderxiang Does it sound good to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19046944
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    --- End diff --
    
    Please check `k > 0`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58313651
  
    @mengxr @coderxiang Hm come to think of it, doesn't this duplicate a lot of the purpose of https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala ? It also computes precision and recall, and supports multi-class classification instead of the binary case only. It feels like these two should be rationalized and not have two unrelated implementations. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59810509
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21921/consoleFull) for   PR 2667 at commit [`be6645e`](https://github.com/apache/spark/commit/be6645eb4a6814f0a8d9983625444630e04e723e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by coderxiang <gi...@git.apache.org>.

Github user coderxiang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18469605
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    --- End diff --
    
    Take the first test case for example, `(Array(1, 6, 2, 7, 8, 3, 9, 10, 4, 5), Array(1 to 5))`, this means for this single query, the ideal ranking algorithm should return Document 1 to Document 5. However, the ranking algorithms returns 10 documents, with IDs (1, 6, .....). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19046952
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, k)
    +    var i = 0
    +    var cnt = 0
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      i += 1
    +    }
    +    cnt.toDouble / k
    +  }.mean
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var i = 0
    +    var cnt = 0
    +    var precSum = 0.0
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }.mean
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at
    +   * at position n will be used. See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated ndcg
    +   * @return the average ndcg at the first k ranking positions
    +   */
    +  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    --- End diff --
    
    check `k > 0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58708705
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21600/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18447381
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.util.LocalSparkContext
    +
    +class RankingMetricsSuite extends FunSuite with LocalSparkContext {
    +  test("Ranking metrics: map, ndcg") {
    +    val predictionAndLabels = sc.parallelize(
    +      Seq(
    +        (Array[Double](1, 6, 2, 7, 8, 3, 9, 10, 4, 5), Array[Double](1, 2, 3, 4, 5)),
    +        (Array[Double](4, 1, 5, 6, 2, 7, 3, 8, 9, 10), Array[Double](1, 2, 3))
    +      ), 2)
    +    val eps: Double = 1e-5
    +
    +    val metrics = new RankingMetrics(predictionAndLabels)
    +    val precAtK = metrics.precAtK.collect()
    +    val avePrec = metrics.avePrec.collect()
    +    val map = metrics.meanAvePrec
    +    val ndcg = metrics.ndcg.collect()
    +    val aveNdcg = metrics.meanNdcg
    +
    +    assert(math.abs(precAtK(0)(4) - 0.4) < eps)
    --- End diff --
    
    Check out the `~==` operator used in other tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58056891
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21333/consoleFull) for   PR 2667 at commit [`e443fee`](https://github.com/apache/spark/commit/e443feef8321dfc26eb5630f4d3b3dc0c6b79b98).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59973208
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21997/consoleFull) for   PR 2667 at commit [`f113ee1`](https://github.com/apache/spark/commit/f113ee11844f7f22c3e719079415d8c86828355c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18554310
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,100 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    --- End diff --
    
    Returning an RDD may not be useful in evaluation. Usually people look for scalar metrics. `prec@k` and `ndge@k` on a test dataset usually mean the average, but not individual ones.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59808976
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21920/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58320921
  
    @srowen Ranking metrics are different from multiclass metrics. In general, multiclass metrics do not consider the ordering of the predictions, but just hits and misses, and they don't truncate the result. Ranking metrics consider the ordering of predictions, applying discounts and truncating the result. I think they are two sets of evaluation metrics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18875990
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    --- End diff --
    
    remove extra empty lines


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18447076
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    --- End diff --
    
    Might actually use `@return` here, but there is also no `k` in the code or docs. This is the length of (both) arguments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19069469
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, k)
    +    var i = 0
    +    var cnt = 0
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      i += 1
    +    }
    +    cnt.toDouble / k
    +  }.mean
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var i = 0
    +    var cnt = 0
    +    var precSum = 0.0
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }.mean
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at
    +   * at position n will be used. See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated ndcg
    +   * @return the average ndcg at the first k ranking positions
    +   */
    +  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val labSetSize = labSet.size
    +    val n = math.min(math.max(pred.length, labSetSize), k)
    +    var maxDcg = 0.0
    +    var dcg = 0.0
    +    var i = 0
    +
    +    while (i < n) {
    +      // Calculate 1/log2(i + 2)
    +      val gain = math.log(2) / math.log(i + 2)
    --- End diff --
    
    `math.log(2)` is by definition but not necessary because of the normalization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by coderxiang <gi...@git.apache.org>.

Github user coderxiang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18467899
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet : Set[Double] = lab.toSet
    --- End diff --
    
    The current setting is exactly the latter case you mentioned. `pred` and `lab` are two sets of IDs that contain the predicted ID and ground truth ID respectively. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129865
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    +   * ground truth set is less than k.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision, must be positive
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val n = math.min(pred.length, k)
    +      var i = 0
    +      var cnt = 0
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +        }
    +        i += 1
    +      }
    +      if (labSet.size == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        cnt.toDouble / k
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries.
    +   * If a query has an empty ground truth set, the average precision will be zero and a log
    +   * warining is generated.
    +   */
    +  lazy val meanAveragePrecision: Double = {
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val labSetSize = labSet.size
    +      var i = 0
    +      var cnt = 0
    +      var precSum = 0.0
    +      val n = pred.length
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +          precSum += cnt.toDouble / (i + 1)
    +        }
    +        i += 1
    +      }
    +      if (labSetSize == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        precSum / labSet.size
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * The discounted cumulative gain at position k is computed as:
    +   *    \sum_{i=1}^k (2^{relevance of ith item} - 1) / log(i + 1),
    +   * and the NDCG is obtained by dividing the DCG value on the ground truth set. In the current
    +   * implementation, the relevance value is binary.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at position n
    +   * will be used. If the ground truth set contains n (n < k) results, the first n items will be
    +   * used to compute the DCG value on the ground truth set.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    --- End diff --
    
    ditto: `returned` -> `used as NDCG`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18875996
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet = lab.toSet
    +    val n = pred.length
    +    val topKPrec = Array.fill[Double](n)(0.0)
    +    var (i, cnt) = (0, 0)
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      topKPrec(i) = cnt.toDouble / (i + 1)
    +      i += 1
    +    }
    +    topKPrec
    +  }
    +
    +  /**
    +   * @param k the position to compute the truncated precision
    --- End diff --
    
    Need a one sentence summary about the method. Also need a link to the paper that describes how to handle the case when we don't have k items.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by coderxiang <gi...@git.apache.org>.

Github user coderxiang commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58701170
  
    @mengxr @srowen  Sorry for late response. In the update, I added two new members that support computing `prec@k` and `ndcg@k`, as suggested by @mengxr . Specifically, if `k` is too large, `prec@k` will be calculated as normal while `ndcg@k` will use the last computed value, as used in [this paper](http://dl.acm.org/citation.cfm?id=345545). We can further discuss whether this is the best option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59468579
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21842/consoleFull) for   PR 2667 at commit [`f66612d`](https://github.com/apache/spark/commit/f66612df4e9d026766071e4c36ff50863aad40bc).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59796463
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21921/consoleFull) for   PR 2667 at commit [`be6645e`](https://github.com/apache/spark/commit/be6645eb4a6814f0a8d9983625444630e04e723e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-60011373
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59465077
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21842/consoleFull) for   PR 2667 at commit [`f66612d`](https://github.com/apache/spark/commit/f66612df4e9d026766071e4c36ff50863aad40bc).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59984474
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21997/consoleFull) for   PR 2667 at commit [`f113ee1`](https://github.com/apache/spark/commit/f113ee11844f7f22c3e719079415d8c86828355c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait Combiner[M, C] `
      * `trait Aggregator[V, A] `
      * `class DefaultCombiner[M: Manifest] extends Combiner[M, Array[M]] with Serializable `
      * `trait Vertex `
      * `trait Message[K] `
      * `public class TaskContext implements Serializable `
      * `abstract class JavaSparkContextVarargsWorkaround `
      * `public class StorageLevels `
      * `class Sorter<K, Buffer> `
      * `trait SparkHadoopMapRedUtil `
      * `trait SparkHadoopMapReduceUtil `
      * `class Accumulable[R, T] (`
      * `trait AccumulableParam[R, T] extends Serializable `
      * `class Accumulator[T](@transient initialValue: T, param: AccumulatorParam[T], name: Option[String])`
      * `trait AccumulatorParam[T] extends AccumulableParam[T, T] `
      * `case class Aggregator[K, V, C] (`
      * `abstract class Dependency[T] extends Serializable `
      * `abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] `
      * `class ShuffleDependency[K, V, C](`
      * `class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) `
      * `class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)`
      * `trait FutureAction[T] extends Future[T] `
      * `class ComplexFutureAction[T] extends FutureAction[T] `
      * `class InterruptibleIterator[+T](val context: TaskContext, val delegate: Iterator[T])`
      * `trait Logging `
      * `trait Partition extends Serializable `
      * `abstract class Partitioner extends Serializable `
      * `class HashPartitioner(partitions: Int) extends Partitioner `
      * `class RangePartitioner[K : Ordering : ClassTag, V](`
      * `class SerializableWritable[T <: Writable](@transient var t: T) extends Serializable `
      * `class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging `
      * `class SparkContext(config: SparkConf) extends Logging `
      * `   * Smarter version of hadoopFile() that uses class tags to figure out the classes of keys,`
      * `   * Smarter version of hadoopFile() that uses class tags to figure out the classes of keys,`
      * `   * converters, but then we couldn't have an object for every subclass of Writable (you can't`
      * `class SparkEnv (`
      * `      logWarning("The spark.cache.class property is no longer being used! Specify storage " +`
      * `class SparkException(message: String, cause: Throwable)`
      * `class SparkHadoopWriter(@transient jobConf: JobConf)`
      * `sealed trait TaskFailedReason extends TaskEndReason `
      * `case class FetchFailed(`
      * `case class ExceptionFailure(`
      * `      "public class " + className + " implements java.io.Serializable `
      * `class JavaDoubleRDD(val srdd: RDD[scala.Double]) extends JavaRDDLike[JDouble, JavaDoubleRDD] `
      * `class JavaHadoopRDD[K, V](rdd: HadoopRDD[K, V])`
      * `class JavaNewHadoopRDD[K, V](rdd: NewHadoopRDD[K, V])`
      * `class JavaPairRDD[K, V](val rdd: RDD[(K, V)])`
      * `class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])`
      * `trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable `
      * `class JavaSparkContext(val sc: SparkContext)`
      * `trait Converter[T, + U] extends Serializable `
      * `  class WriterThread(env: SparkEnv, worker: Socket, split: Partition, context: TaskContext)`
      * `  class MonitorThread(env: SparkEnv, worker: Socket, context: TaskContext)`
      * `class BytesToString extends org.apache.spark.api.java.function.Function[Array[Byte], String] `
      * `  class ArrayConstructor extends net.razorvine.pickle.objects.ArrayConstructor `
      * `case class TestWritable(var str: String, var int: Int, var double: Double) extends Writable `
      * `abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable `
      * `trait BroadcastFactory `
      * `class HttpBroadcastFactory extends BroadcastFactory `
      * `class TorrentBroadcastFactory extends BroadcastFactory `
      * `  case class RegisterWorker(`
      * `  case class ExecutorStateChanged(`
      * `  case class DriverStateChanged(`
      * `  case class WorkerSchedulerStateResponse(id: String, executors: List[ExecutorDescription],`
      * `  case class Heartbeat(workerId: String) extends DeployMessage`
      * `  case class RegisteredWorker(masterUrl: String, masterWebUiUrl: String) extends DeployMessage`
      * `  case class RegisterWorkerFailed(message: String) extends DeployMessage`
      * `  case class KillExecutor(masterUrl: String, appId: String, execId: Int) extends DeployMessage`
      * `  case class LaunchExecutor(`
      * `  case class LaunchDriver(driverId: String, driverDesc: DriverDescription) extends DeployMessage`
      * `  case class KillDriver(driverId: String) extends DeployMessage`
      * `  case class RegisterApplication(appDescription: ApplicationDescription)`
      * `  case class MasterChangeAcknowledged(appId: String)`
      * `  case class RegisteredApplication(appId: String, masterUrl: String) extends DeployMessage`
      * `  case class ExecutorAdded(id: Int, workerId: String, hostPort: String, cores: Int, memory: Int) `
      * `  case class ExecutorUpdated(id: Int, state: ExecutorState, message: Option[String],`
      * `  case class ApplicationRemoved(message: String)`
      * `  case class RequestSubmitDriver(driverDescription: DriverDescription) extends DeployMessage`
      * `  case class SubmitDriverResponse(success: Boolean, driverId: Option[String], message: String)`
      * `  case class RequestKillDriver(driverId: String) extends DeployMessage`
      * `  case class KillDriverResponse(driverId: String, success: Boolean, message: String)`
      * `  case class RequestDriverStatus(driverId: String) extends DeployMessage`
      * `  case class DriverStatusResponse(found: Boolean, state: Option[DriverState],`
      * `  case class MasterChanged(masterUrl: String, masterWebUiUrl: String)`
      * `  case class MasterStateResponse(host: String, port: Int, workers: Array[WorkerInfo],`
      * `  case class WorkerStateResponse(host: String, port: Int, workerId: String,`
      * `class LocalSparkCluster(numWorkers: Int, coresPerWorker: Int, memoryPerWorker: Int)`
      * `    // Build up a PYTHONPATH that includes the Spark assembly JAR (where this class is), the`
      * `class SparkHadoopUtil extends Logging `
      * `   *           (4) the main class for the child`
      * `          println(s"Failed to load main class $childMainClass.")`
      * `      throw new IllegalStateException("The main method in the given main class must be static")`
      * `          SparkSubmit.printErrorAndExit("Cannot load main class from JAR: " + primaryResource)`
      * `      SparkSubmit.printErrorAndExit("No main class set in JAR; please specify one with --class")`
      * `        |  --class CLASS_NAME          Your application's main class (for Java / Scala apps).`
      * `  class ClientActor extends Actor with ActorLogReceive with Logging `
      * `  class TestListener extends AppClientListener with Logging `
      * `class HistoryServer(`
      * `      |  spark.history.provider             Name of history provider class (defaults to`
      * `class ApplicationSource(val application: ApplicationInfo) extends Source `
      * `  case class BeginRecovery(storedApps: Seq[ApplicationInfo], storedWorkers: Seq[WorkerInfo])`
      * `  case class WebUIPortResponse(webUIBoundPort: Int)`
      * `class ZooKeeperPersistenceEngine(serialization: Serialization, conf: SparkConf)`
      * `class MasterWebUI(val master: Master, requestedPort: Int)`
      * `class WorkerWebUI(`
      * `  class TaskRunner(`
      * `      logInfo("Using REPL class URI: " + classUri)`
      * `          logInfo("Adding " + url + " to class loader")`
      * `class TaskMetrics extends Serializable `
      * `case class InputMetrics(readMethod: DataReadMethod.Value) `
      * `class ShuffleReadMetrics extends Serializable `
      * `class ShuffleWriteMetrics extends Serializable `
      * `trait CompressionCodec `
      * `class LZ4CompressionCodec(conf: SparkConf) extends CompressionCodec `
      * `class LZFCompressionCodec(conf: SparkConf) extends CompressionCodec `
      * `class SnappyCompressionCodec(conf: SparkConf) extends CompressionCodec `
      * `        case e: Exception => logError("Source class " + classPath + " cannot be instantiated", e)`
      * `          case e: Exception => logError("Sink class " + classPath + " cannot be instantialized", e)`
      * `trait BlockDataManager `
      * `trait BlockFetchingListener extends EventListener `
      * `abstract class BlockTransferService `
      * `sealed abstract class ManagedBuffer `
      * `final class FileSegmentManagedBuffer(val file: File, val offset: Long, val length: Long)`
      * `final class NioByteBufferManagedBuffer(buf: ByteBuffer) extends ManagedBuffer `
      * `final class NettyByteBufManagedBuffer(buf: ByteBuf) extends ManagedBuffer `
      * `class NettyConfig(conf: SparkConf) `
      * `trait PathResolver `
      * `trait BlockClientListener extends EventListener `
      * `class BlockFetchingClient(factory: BlockFetchingClientFactory, hostname: String, port: Int)`
      * `class BlockFetchingClientFactory(val conf: NettyConfig) `
      * `class BlockFetchingClientHandler extends SimpleChannelInboundHandler[ByteBuf] with Logging `
      * `class LazyInitIterator(createIterator: => Iterator[Any]) extends Iterator[Any] `
      * `class ReferenceCountedBuffer(val underlying: ByteBuf) extends AnyVal `
      * `class BlockHeader(val blockSize: Int, val blockId: String, val error: Option[String] = None)`
      * `class BlockHeaderEncoder extends MessageToByteEncoder[BlockHeader] `
      * `class BlockServer(conf: NettyConfig, dataProvider: BlockDataProvider) extends Logging `
      * `class BlockServerChannelInitializer(dataProvider: BlockDataProvider)`
      * `class BlockServerHandler(dataProvider: BlockDataProvider)`
      * `class BlockMessageArray(var blockMessages: Seq[BlockMessage])`
      * `class BufferMessage(id_ : Int, val buffers: ArrayBuffer[ByteBuffer], var ackId: Int)`
      * `abstract class Connection(val channel: SocketChannel, val selector: Selector,`
      * `class SendingConnection(val address: InetSocketAddress, selector_ : Selector,`
      * `  class Inbox() `
      * `  class MessageStatus(`
      * `class MessageChunk(val header: MessageChunkHeader, val buffer: ByteBuffer) `
      * `final class NioBlockTransferService(conf: SparkConf, securityManager: SecurityManager)`
      * `class BoundedDouble(val mean: Double, val confidence: Double, val low: Double, val high: Double) `
      * `class PartialResult[R](initialVal: R, isFinal: Boolean) `
      * `class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Logging `
      * `class BlockRDD[T: ClassTag](@transient sc: SparkContext, @transient val blockIds: Array[BlockId])`
      * `class CartesianPartition(`
      * `class CartesianRDD[T: ClassTag, U: ClassTag](`
      * `class CheckpointRDD[T: ClassTag](sc: SparkContext, val checkpointPath: String)`
      * `class CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ <: Product2[K, _]]], part: Partitioner)`
      * `  val rnd = new scala.util.Random(7919) // keep this class deterministic`
      * `  class LocationIterator(prev: RDD[_]) extends Iterator[(String, Partition)] `
      * `class DoubleRDDFunctions(self: RDD[Double]) extends Logging with Serializable `
      * `class FlatMappedRDD[U: ClassTag, T: ClassTag](`
      * `class FlatMappedValuesRDD[K, V, U](prev: RDD[_ <: Product2[K, V]], f: V => TraversableOnce[U])`
      * `class HadoopRDD[K, V](`
      * `class JdbcRDD[T: ClassTag](`
      * `class MappedRDD[U: ClassTag, T: ClassTag](prev: RDD[T], f: T => U)`
      * `class MappedValuesRDD[K, V, U](prev: RDD[_ <: Product2[K, V]], f: V => U)`
      * `class NewHadoopRDD[K, V](`
      * `class PairRDDFunctions[K, V](self: RDD[(K, V)])`
      * `      throw new SparkException("Output format class not set")`
      * `      throw new SparkException("Output key class not set")`
      * `      throw new SparkException("Output value class not set")`
      * `class PartitionPruningRDD[T: ClassTag](`
      * `class PartitionerAwareUnionRDDPartition(`
      * `class PartitionerAwareUnionRDD[T: ClassTag](`
      * `class PartitionwiseSampledRDDPartition(val prev: Partition, val seed: Long)`
      * `  class NotEqualsFileNameFilter(filterName: String) extends FilenameFilter `
      * `abstract class RDD[T: ClassTag](`
      * `class SampledRDDPartition(val prev: Partition, val seed: Int) extends Partition with Serializable `
      * `class SequenceFileRDDFunctions[K <% Writable: ClassTag, V <% Writable : ClassTag](`
      * `            m => m.getReturnType().toString != "class java.lang.Object" &&`
      * `class ShuffledRDD[K, V, C](`
      * `class UnionRDD[T: ClassTag](`
      * `class ZippedWithIndexRDDPartition(val prev: Partition, val startIndex: Long)`
      * `class ZippedWithIndexRDD[T: ClassTag](@transient prev: RDD[T]) extends RDD[(T, Long)](prev) `
      * `class AccumulableInfo (`
      * `class DAGScheduler(`
      * `case class BeginEvent(task: Task[_], taskInfo: TaskInfo) extends DAGSchedulerEvent`
      * `case class GettingResultEvent(taskInfo: TaskInfo) extends DAGSchedulerEvent`
      * `case class TaskSetFailed(taskSet: TaskSet, reason: String) extends DAGSchedulerEvent`
      * `class ExecutorLossReason(val message: String) `
      * `case class ExecutorExited(val exitCode: Int)`
      * `case class SlaveLost(_message: String = "Slave lost")`
      * `class InputFormatInfo(val configuration: Configuration, val inputFormatClazz: Class[_],`
      * `class JobLogger(val user: String, val logDirName: String) extends SparkListener with Logging `
      * `case class SparkListenerStageSubmitted(stageInfo: StageInfo, properties: Properties = null)`
      * `case class SparkListenerStageCompleted(stageInfo: StageInfo) extends SparkListenerEvent`
      * `case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)`
      * `case class SparkListenerTaskGettingResult(taskInfo: TaskInfo) extends SparkListenerEvent`
      * `case class SparkListenerTaskEnd(`
      * `case class SparkListenerJobStart(jobId: Int, stageIds: Seq[Int], properties: Properties = null)`
      * `case class SparkListenerJobEnd(jobId: Int, jobResult: JobResult) extends SparkListenerEvent`
      * `case class SparkListenerEnvironmentUpdate(environmentDetails: Map[String, Seq[(String, String)]])`
      * `case class SparkListenerBlockManagerAdded(time: Long, blockManagerId: BlockManagerId, maxMem: Long)`
      * `case class SparkListenerBlockManagerRemoved(time: Long, blockManagerId: BlockManagerId)`
      * `case class SparkListenerUnpersistRDD(rddId: Int) extends SparkListenerEvent`
      * `case class SparkListenerExecutorMetricsUpdate(`
      * `case class SparkListenerApplicationStart(appName: String, appId: Option[String], time: Long,`
      * `case class SparkListenerApplicationEnd(time: Long) extends SparkListenerEvent`
      * `trait SparkListener `
      * `class StatsReportListener extends SparkListener with Logging `
      * `class SplitInfo(`
      * `class StageInfo(`
      * `class TaskInfo(`
      * `case class IndirectTaskResult[T](blockId: BlockId) extends TaskResult[T] with Serializable`
      * `class DirectTaskResult[T](var valueBytes: ByteBuffer, var accumUpdates: Map[Long, Any],`
      * `case class WorkerOffer(executorId: String, host: String, cores: Int)`
      * `  case class LaunchTask(data: SerializableBuffer) extends CoarseGrainedClusterMessage`
      * `  case class KillTask(taskId: Long, executor: String, interruptThread: Boolean)`
      * `  case class RegisterExecutorFailed(message: String) extends CoarseGrainedClusterMessage`
      * `  case class RegisterExecutor(executorId: String, hostPort: String, cores: Int)`
      * `  case class StatusUpdate(executorId: String, taskId: Long, state: TaskState,`
      * `  case class RemoveExecutor(executorId: String, reason: String) extends CoarseGrainedClusterMessage`
      * `  case class AddWebUIFilter(filterName:String, filterParams: Map[String, String], proxyBase :String)`
      * `class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, actorSystem: ActorSystem)`
      * `  class DriverActor(sparkProperties: Seq[(String, String)]) extends Actor with ActorLogReceive `
      * `class JavaSerializer(conf: SparkConf) extends Serializer with Externalizable `
      * `class KryoSerializer(conf: SparkConf)`
      * `class KryoSerializationStream(kryo: Kryo, outStream: OutputStream) extends SerializationStream `
      * `class KryoDeserializationStream(kryo: Kryo, inStream: InputStream) extends DeserializationStream `
      * `trait KryoRegistrator `
      * `  // The class returned by asJavaIterable (scala.collection.convert.Wrappers$IterableWrapper).`
      * `abstract class Serializer `
      * `abstract class SerializerInstance `
      * `abstract class SerializationStream `
      * `abstract class DeserializationStream `
      * `class FileShuffleBlockManager(conf: SparkConf)`
      * `class IndexShuffleBlockManager extends ShuffleBlockManager `
      * `trait ShuffleBlockManager `
      * `case class BlockException(blockId: BlockId, message: String) extends Exception(message)`
      * `sealed abstract class BlockId `
      * `case class RDDBlockId(rddId: Int, splitIndex: Int) extends BlockId `
      * `case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId `
      * `case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId `
      * `case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId `
      * `case class BroadcastBlockId(broadcastId: Long, field: String = "") extends BlockId `
      * `case class TaskResultBlockId(taskId: Long) extends BlockId `
      * `case class StreamBlockId(streamId: Int, uniqueId: Long) extends BlockId `
      * `class BlockManagerMaster(`
      * `class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus: LiveListenerBus)`
      * `case class BlockStatus(`
      * `  case class RemoveBlock(blockId: BlockId) extends ToBlockManagerSlave`
      * `  case class RemoveRdd(rddId: Int) extends ToBlockManagerSlave`
      * `  case class RemoveShuffle(shuffleId: Int) extends ToBlockManagerSlave`
      * `  case class RemoveBroadcast(broadcastId: Long, removeFromDriver: Boolean = true)`
      * `  case class RegisterBlockManager(`
      * `  case class UpdateBlockInfo(`
      * `  case class GetLocations(blockId: BlockId) extends ToBlockManagerMaster`
      * `  case class GetLocationsMultipleBlockIds(blockIds: Array[BlockId]) extends ToBlockManagerMaster`
      * `  case class GetPeers(blockManagerId: BlockManagerId) extends ToBlockManagerMaster`
      * `  case class RemoveExecutor(execId: String) extends ToBlockManagerMaster`
      * `  case class GetBlockStatus(blockId: BlockId, askSlaves: Boolean = true)`
      * `  case class GetMatchingBlockIds(filter: BlockId => Boolean, askSlaves: Boolean = true)`
      * `  case class BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends ToBlockManagerMaster`
      * `class BlockManagerSlaveActor(`
      * `class BlockNotFoundException(blockId: String) extends Exception(s"Block $blockId not found")`
      * `class RDDInfo(`
      * `final class ShuffleBlockFetcherIterator(`
      * `  class FetchRequest(val address: BlockManagerId, val blocks: Seq[(BlockId, Long)]) `
      * `  class FetchResult(val blockId: BlockId, val size: Long, val deserialize: () => Iterator[Any]) `
      * `class StorageStatusListener extends SparkListener `
      * `class StorageStatus(val blockManagerId: BlockManagerId, val maxMem: Long) `
      * `  class ServletParams[T <% AnyRef](val responder: Responder[T],`
      * `class EnvironmentListener extends SparkListener `
      * `class ExecutorsListener(storageStatusListener: StorageStatusListener) extends SparkListener `
      * `class JobProgressListener(conf: SparkConf) extends SparkListener with Logging `
      * `  class ExecutorSummary `
      * `  class StageUIData `
      * `  case class TaskUIData(`
      * `class StorageListener(storageStatusListener: StorageStatusListener) extends SparkListener `
      * `class ReturnStatementFinder extends ClassVisitor(ASM4) `
      * `class FieldAccessFinder(output: Map[Class[_], Set[String]]) extends ClassVisitor(ASM4) `
      * `abstract class CompletionIterator[ +A, +I <: Iterator[A]](sub: I) extends Iterator[A] `
      * `   * (dirPermissions) used when class was instantiated.`
      * `case class MutablePair[@specialized(Int, Long, Double, Char, Boolean/* , AnyRef */) T1,`
      * `class SerializableBuffer(@transient var buffer: ByteBuffer) extends Serializable `
      * `      "Non-primitive class " + cls + " passed to primitiveSize()")`
      * `class StatCounter(values: TraversableOnce[Double]) extends Serializable `
      * `trait TaskCompletionListener extends EventListener `
      * `class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception `
      * `class Vector(val elements: Array[Double]) extends Serializable `
      * `  class Multiplier(num: Double) `
      * `class AppendOnlyMap[K, V](initialCapacity: Int = 64)`
      * `class BitSet(numBits: Int) extends Serializable `
      * `class ExternalAppendOnlyMap[K, V, C](`
      * `class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag](`
      * `  // specialization to work (specialized class extends the non-specialized one and needs access`
      * `class OpenHashSet[@specialized(Long, Int) T: ClassTag](`
      * `  // specialization to work (specialized class extends the non-specialized one and needs access`
      * `  sealed class Hasher[@specialized(Long, Int) T] extends Serializable `
      * `  class LongHasher extends Hasher[Long] `
      * `  class IntHasher extends Hasher[Int] `
      * `class PrimitiveKeyOpenHashMap[@specialized(Long, Int) K: ClassTag,`
      * `  // specialization to work (specialized class extends the unspecialized one and needs access`
      * `class PrimitiveVector[@specialized(Long, Int, Double) V: ClassTag](initialSize: Int = 64) `
      * `  case class Sample(size: Long, numUpdates: Long)`
      * `class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] `
      * `class ByteArrayChunkOutputStream(chunkSize: Int) extends OutputStream `
      * `trait Pseudorandom `
      * `trait RandomSampler[T, U] extends Pseudorandom with Cloneable with Serializable `
      * `class BernoulliSampler[T](lb: Double, ub: Double, complement: Boolean = false)`
      * `class PoissonSampler[T](mean: Double) extends RandomSampler[T, T] `
      * `public class SimpleApp `
      * `case class Person(name: String, age: Int)`
      * `case class Person(name: String, age: Int)`
      * `class UsageError(Exception):`
      * `class SparkSink extends AbstractSink with Logging with Configurable `
      * `class FlumeInputDStream[T: ClassTag](`
      * `class SparkFlumeEvent() extends Externalizable `
      * `class FlumeEventServer(receiver : FlumeReceiver) extends AvroSourceProtocol `
      * `class FlumeReceiver(`
      * `  class CompressionChannelPipelineFactory extends ChannelPipelineFactory `
      * `class MQTTInputDStream(`
      * `class MQTTReceiver(`
      * `class TwitterInputDStream(`
      * `class TwitterReceiver(`
      * `public final class JavaKinesisWordCountASL `
      * `class GangliaSink(val property: Properties, val registry: MetricRegistry,`
      * `class NonASCIICharacterChecker extends ScalariformChecker `
      * `class SparkSpaceAfterCommentStartChecker extends ScalariformChecker `
      * `class ThreadSafeFinalizer(object):`
      * `class Finalizer(object):`
      * `class JavaIterator(JavaObject):`
      * `class JavaMap(JavaObject, MutableMapping):`
      * `class JavaSet(JavaObject, MutableSet):`
      * `class JavaArray(JavaObject, Sequence):`
      * `class JavaList(JavaObject, MutableSequence):`
      * `class SetConverter(object):`
      * `class ListConverter(object):`
      * `class MapConverter(object):`
      * `class NullHandler(logging.Handler):`
      * `    :import_str: The class (e.g., java.util.List) or the package`
      * `        raise Py4JError("java_class must be a string, a JavaClass, or a JavaObject")`
      * `class DummyRLock(object):`
      * `class GatewayClient(object):`
      * `class GatewayConnection(object):`
      * `class JavaMember(object):`
      * `class JavaObject(object):`
      * `class JavaClass():`
      * `class JavaPackage():`
      * `class JVMView(object):`
      * `class GatewayProperty(object):`
      * `class JavaGateway(object):`
      * `        :rtype: A JVMView instance (same class as the gateway.jvm instance).`
      * `class CallbackServer(object):`
      * `class CallbackConnection(Thread):`
      * `class PythonProxyPool(object):`
      * `of small strings are exchanged (method name, class name, variable`
      * `class Py4JError(Exception):`
      * `class Py4JNetworkError(Py4JError):`
      * `class Py4JJavaError(Py4JError):`
      * `>>> class VectorAccumulatorParam(AccumulatorParam):`
      * `class Accumulator(object):`
      * `class AccumulatorParam(object):`
      * `class AddingAccumulatorParam(AccumulatorParam):`
      * `class PStatsParam(AccumulatorParam):`
      * `class _UpdateRequestHandler(SocketServer.StreamRequestHandler):`
      * `class AccumulatorServer(SocketServer.TCPServer):`
      * `class Broadcast(object):`
      * `class CloudPickler(pickle.Pickler):`
      * `        class Dummy(object):`
      * `class SparkConf(object):`
      * `class SparkContext(object):`
      * `class SparkFiles(object):`
      * `        class EchoOutputThread(Thread):`
      * `class LogisticRegressionModel(LinearModel):`
      * `class LogisticRegressionWithSGD(object):`
      * `class SVMModel(LinearModel):`
      * `class SVMWithSGD(object):`
      * `class NaiveBayesModel(object):`
      * `    - pi: vector of logs of class priors (dimension C)`
      * `    - theta: matrix of logs of class conditional probabilities (CxD)`
      * `class NaiveBayes(object):`
      * `class KMeansModel(object):`
      * `class KMeans(object):`
      * `class Vector(object):`
      * `class DenseVector(Vector):`
      * `class SparseVector(Vector):`
      * `class Vectors(object):`
      * `class Matrix(object):`
      * `class DenseMatrix(Matrix):`
      * `class RandomRDDs(object):`
      * `class Rating(object):`
      * `class MatrixFactorizationModel(object):`
      * `class ALS(object):`
      * `class LabeledPoint(object):`
      * `class LinearModel(object):`
      * `class LinearRegressionModelBase(LinearModel):`
      * `class LinearRegressionModel(LinearRegressionModelBase):`
      * `class LinearRegressionWithSGD(object):`
      * `class LassoModel(LinearRegressionModelBase):`
      * `class LassoWithSGD(object):`
      * `class RidgeRegressionModel(LinearRegressionModelBase):`
      * `class RidgeRegressionWithSGD(object):`
      * `class MultivariateStatisticalSummary(object):`
      * `class Statistics(object):`
      * `class DecisionTreeModel(object):`
      * `class DecisionTree(object):`
      * `class MLUtils(object):`
      * `class BoundedFloat(float):`
      * `class RDD(object):`
      * `class PipelinedRDD(RDD):`
      * `class RDDSamplerBase(object):`
      * `class RDDSampler(RDDSamplerBase):`
      * `class RDDStratifiedSampler(RDDSamplerBase):`
      * `class ResultIterable(collections.Iterable):`
      * `class SpecialLengths(object):`
      * `class Serializer(object):`
      * `class FramedSerializer(Serializer):`
      * `class BatchedSerializer(Serializer):`
      * `class AutoBatchedSerializer(BatchedSerializer):`
      * `class CartesianDeserializer(FramedSerializer):`
      * `class PairDeserializer(CartesianDeserializer):`
      * `class NoOpSerializer(FramedSerializer):`
      * `class PickleSerializer(FramedSerializer):`
      * `class CloudPickleSerializer(PickleSerializer):`
      * `class MarshalSerializer(FramedSerializer):`
      * `class AutoSerializer(FramedSerializer):`
      * `class CompressedSerializer(FramedSerializer):`
      * `class UTF8Deserializer(Serializer):`
      * `class Aggregator(object):`
      * `class SimpleAggregator(Aggregator):`
      * `class Merger(object):`
      * `class InMemoryMerger(Merger):`
      * `class ExternalMerger(Merger):`
      * `class ExternalSorter(object):`
      * `class DataType(object):`
      * `class PrimitiveTypeSingleton(type):`
      * `class PrimitiveType(DataType):`
      * `class StringType(PrimitiveType):`
      * `class BinaryType(PrimitiveType):`
      * `class BooleanType(PrimitiveType):`
      * `class TimestampType(PrimitiveType):`
      * `class DecimalType(PrimitiveType):`
      * `class DoubleType(PrimitiveType):`
      * `class FloatType(PrimitiveType):`
      * `class ByteType(PrimitiveType):`
      * `class IntegerType(PrimitiveType):`
      * `class LongType(PrimitiveType):`
      * `class ShortType(PrimitiveType):`
      * `class ArrayType(DataType):`
      * `class MapType(DataType):`
      * `class StructField(DataType):`
      * `class StructType(DataType):`
      * `    class Row(tuple):`
      * `class SQLContext(object):`
      * `class HiveContext(SQLContext):`
      * `class LocalHiveContext(HiveContext):`
      * `class TestHiveContext(HiveContext):`
      * `class Row(tuple):`
      * `class SchemaRDD(RDD):`
      * `        L`
      * `class StatCounter(object):`
      * `class StorageLevel(object):`
      * `class SCCallSiteSync(object):`
      * `class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,`
      * `class ConstructorCleaner(className: String, cv: ClassVisitor)`
      * `class SparkCommandLine(args: List[String], override val settings: Settings)`
      * `trait SparkExprTyper extends Logging `
      * `class SparkILoop(in0: Option[BufferedReader], protected val out: JPrintWriter,`
      * `  class IMainOps[T <: SparkIMain](val intp: T) `
      * `  class SparkILoopInterpreter extends SparkIMain(settings, out) `
      * `    cmd("javap", "<path|class>", "disassemble a file or class name", javapCommand),`
      * `trait SparkILoopInit `
      * `  class SparkIMain(initialSettings: Settings, val out: JPrintWriter)`
      * `    val classServer                                   = new HttpServer(outputDir, new SecurityManager(conf), classServerPort, "HTTP class server")`
      * `    implicit class ReplTypeOps(tp: Type) `
      * `  class ReadEvalPrint(val lineId: Int) `
      * `    class EvalException(msg: String, cause: Throwable) extends RuntimeException(msg, cause) `
      * `  class Request(val line: String, val trees: List[Tree]) `
      * `        |class %s extends Serializable `
      * `  trait CodeAssembler[T] `
      * `  trait StrippingWriter `
      * `  trait TruncatingWriter `
      * `  abstract class StrippingTruncatingWriter(out: JPrintWriter)`
      * `  class ReplStrippingWriter(intp: SparkIMain) extends StrippingTruncatingWriter(intp.out) `
      * `  class ReplReporter(intp: SparkIMain) extends ConsoleReporter(intp.settings, null, new ReplStrippingWriter(intp)) `
      * `class SparkISettings(intp: SparkIMain) extends Logging `
      * `trait SparkImports `
      * `  case class SparkComputedImports(prepend: String, append: String, access: String)`
      * `    case class ReqAndHandler(req: Request, handler: MemberHandler) `
      * `      code append "class %sC extends Serializable `
      * `class SparkJLineCompletion(val intp: SparkIMain) extends Completion with CompletionOutput with Logging `
      * `  trait CompilerCompletion `
      * `  class TypeMemberCompletion(val tp: Type) extends CompletionAware`
      * `  class PackageCompletion(tp: Type) extends TypeMemberCompletion(tp) `
      * `  class LiteralCompletion(lit: Literal) extends TypeMemberCompletion(lit.value.tpe) `
      * `  class ImportCompletion(tp: Type) extends TypeMemberCompletion(tp) `
      * `  class JLineTabCompletion extends ScalaCompleter `
      * `class SparkJLineReader(_completion: => Completion) extends InteractiveReader `
      * `  class JLineConsoleReader extends ConsoleReader with ConsoleReaderHelper `
      * `class SparkJLineHistory extends JLineFileHistory `
      * `trait SparkMemberHandlers `
      * `  sealed abstract class MemberDefHandler(override val member: MemberDef) extends MemberHandler(member) `
      * `  sealed abstract class MemberHandler(val member: Tree) `
      * `  class GenericHandler(member: Tree) extends MemberHandler(member)`
      * `  class ValHandler(member: ValDef) extends MemberDefHandler(member) `
      * `  class DefHandler(member: DefDef) extends MemberDefHandler(member) `
      * `  class AssignHandler(member: Assign) extends MemberHandler(member) `
      * `  class ModuleHandler(module: ModuleDef) extends MemberDefHandler(module) `
      * `  class ClassHandler(member: ClassDef) extends MemberDefHandler(member) `
      * `  class TypeAliasHandler(member: TypeDef) extends MemberDefHandler(member) `
      * `  class ImportHandler(imp: Import) extends MemberHandler(imp) `
      * `class SparkRunnerSettings(error: String => Unit) extends Settings(error)`
      * `  case class Schema(dataType: DataType, nullable: Boolean)`
      * `  implicit class CaseClassRelation[A <: Product : TypeTag](data: Seq[A]) `
      * `class SqlParser extends StandardTokenParsers with PackratParsers `
      * `  protected case class Keyword(str: String)`
      * `class SqlLexical(val keywords: Seq[String]) extends StdLexical `
      * `  case class FloatLit(chars: String) extends Token `
      * `class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Boolean)`
      * `trait Catalog `
      * `class SimpleCatalog(val caseSensitive: Boolean) extends Catalog `
      * `trait OverrideCatalog extends Catalog `
      * `trait FunctionRegistry `
      * `trait OverrideFunctionRegistry extends FunctionRegistry `
      * `class SimpleFunctionRegistry extends FunctionRegistry `
      * `trait HiveTypeCoercion `
      * `trait MultiInstanceRelation `
      * `class UnresolvedException[TreeType <: TreeNode[_]](tree: TreeType, function: String) extends`
      * `case class UnresolvedRelation(`
      * `case class UnresolvedAttribute(name: String) extends Attribute with trees.LeafNode[Expression] `
      * `case class UnresolvedFunction(name: String, children: Seq[Expression]) extends Expression `
      * `case class Star(`
      * `  trait ImplicitOperators `
      * `  trait ExpressionConversions `
      * `    implicit class DslExpression(e: Expression) extends ImplicitOperators `
      * `    implicit class DslSymbol(sym: Symbol) extends ImplicitAttribute `
      * `    implicit class DslString(val s: String) extends ImplicitOperators `
      * `    abstract class ImplicitAttribute extends ImplicitOperators `
      * `    implicit class DslAttribute(a: AttributeReference) `
      * `  abstract class LogicalPlanFunctions `
      * `    implicit class DslLogicalPlan(val logicalPlan: LogicalPlan) extends LogicalPlanFunctions `
      * `  class TreeNodeException[TreeType <: TreeNode[_]](`
      * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`
      * `protected class AttributeEquals(val a: Attribute) `
      * `case class BoundReference(ordinal: Int, dataType: DataType, nullable: Boolean)`
      * `case class Cast(child: Expression, dataType: DataType) extends UnaryExpression `
      * `abstract class Expression extends TreeNode[Expression] `
      * `abstract class BinaryExpression extends Expression with trees.BinaryNode[Expression] `
      * `abstract class LeafExpression extends Expression with trees.LeafNode[Expression] `
      * `abstract class UnaryExpression extends Expression with trees.UnaryNode[Expression] `
      * `class InterpretedProjection(expressions: Seq[Expression]) extends Projection `
      * `case class InterpretedMutableProjection(expressions: Seq[Expression]) extends MutableProjection `
      * `class JoinedRow extends Row `
      * `class JoinedRow2 extends Row `
      * `class JoinedRow3 extends Row `
      * `class JoinedRow4 extends Row `
      * `class JoinedRow5 extends Row `
      * `trait Row extends Seq[Any] with Serializable `
      * `trait MutableRow extends Row `
      * `class GenericRow(protected[sql] val values: Array[Any]) extends Row `
      * `class GenericMutableRow(size: Int) extends GenericRow(size) with MutableRow `
      * `class RowOrdering(ordering: Seq[SortOrder]) extends Ordering[Row] `
      * `case class ScalaUdf(function: AnyRef, dataType: DataType, children: Seq[Expression])`
      * `case class SortOrder(child: Expression, direction: SortDirection) extends Expression `
      * `abstract class MutableValue extends Serializable `
      * `final class MutableInt extends MutableValue `
      * `final class MutableFloat extends MutableValue `
      * `final class MutableBoolean extends MutableValue `
      * `final class MutableDouble extends MutableValue `
      * `final class MutableShort extends MutableValue `
      * `final class MutableLong extends MutableValue `
      * `final class MutableByte extends MutableValue `
      * `final class MutableAny extends MutableValue `
      * `final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow `
      * `case class WrapDynamic(children: Seq[Attribute]) extends Expression `
      * `class DynamicRow(val schema: Seq[Attribute], values: Array[Any])`
      * `abstract class AggregateExpression extends Expression `
      * `case class SplitEvaluation(`
      * `abstract class PartialAggregate extends AggregateExpression `
      * `case class Min(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] `
      * `case class MinFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction `
      * `case class Max(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] `
      * `case class MaxFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction `
      * `case class Count(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] `
      * `case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate `
      * `case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression `
      * `case class CollectHashSetFunction(`
      * `case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression `
      * `case class CombineSetsAndCountFunction(`
      * `case class ApproxCountDistinctPartition(child: Expression, relativeSD: Double)`
      * `case class ApproxCountDistinctMerge(child: Expression, relativeSD: Double)`
      * `case class ApproxCountDistinct(child: Expression, relativeSD: Double = 0.05)`
      * `case class Average(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] `
      * `case class Sum(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] `
      * `case class SumDistinct(child: Expression)`
      * `case class First(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] `
      * `case class Last(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] `
      * `case class AverageFunction(expr: Expression, base: AggregateExpression)`
      * `case class CountFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction `
      * `case class ApproxCountDistinctPartitionFunction(`
      * `case class ApproxCountDistinctMergeFunction(`
      * `case class SumFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction `
      * `case class SumDistinctFunction(expr: Expression, base: AggregateExpression)`
      * `case class CountDistinctFunction(`
      * `case class FirstFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction `
      * `case class LastFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction `
      * `case class UnaryMinus(child: Expression) extends UnaryExpression `
      * `case class Sqrt(child: Expression) extends UnaryExpression `
      * `abstract class BinaryArithmetic extends BinaryExpression `
      * `case class Add(left: Expression, right: Expression) extends BinaryArithmetic `
      * `case class Subtract(left: Expression, right: Expression) extends BinaryArithmetic `
      * `case class Multiply(left: Expression, right: Expression) extends BinaryArithmetic `
      * `case class Divide(left: Expression, right: Expression) extends BinaryArithmetic `
      * `case class Remainder(left: Expression, right: Expression) extends BinaryArithmetic `
      * `case class MaxOf(left: Expression, right: Expression) extends Expression `
      * `case class Abs(child: Expression) extends UnaryExpression  `
      * `abstract class CodeGenerator[InType <: AnyRef, OutType <: AnyRef] extends Logging `
      * `  protected case class EvaluatedExpression(`
      * `    implicit class Evaluate1(e: Expression) `
      * `    implicit class Evaluate2(expressions: (Expression, Expression)) `
      * `    val q"class $orderingName extends $orderingType `
      * `      class SpecificOrdering extends Ordering[Row] `
      * `      class $orderingName extends $orderingType `
      * `      final class SpecificRow(i: $rowType) extends $mutableRowType `
      * `case class GetItem(child: Expression, ordinal: Expression) extends Expression `
      * `case class GetField(child: Expression, fieldName: String) extends UnaryExpression `
      * `abstract class Generator extends Expression `
      * `case class Explode(attributeNames: Seq[String], child: Expression)`
      * `case class Literal(value: Any, dataType: DataType) extends LeafExpression `
      * `case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true) `
      * `case class ExprId(id: Long)`
      * `abstract class NamedExpression extends Expression `
      * `abstract class Attribute extends NamedExpression `
      * `case class Alias(child: Expression, name: String)`
      * `case class AttributeReference(name: String, dataType: DataType, nullable: Boolean = true)`
      * `case class Coalesce(children: Seq[Expression]) extends Expression `
      * `case class IsNull(child: Expression) extends Predicate with trees.UnaryNode[Expression] `
      * `case class IsNotNull(child: Expression) extends Predicate with trees.UnaryNode[Expression] `
      * `  abstract class Projection extends (Row => Row)`
      * `  abstract class MutableProjection extends Projection `
      * `trait Predicate extends Expression `
      * `trait PredicateHelper `
      * `abstract class BinaryPredicate extends BinaryExpression with Predicate `
      * `case class Not(child: Expression) extends UnaryExpression with Predicate `
      * `case class In(value: Expression, list: Seq[Expression]) extends Predicate `
      * `case class And(left: Expression, right: Expression) extends BinaryPredicate `
      * `case class Or(left: Expression, right: Expression) extends BinaryPredicate `
      * `abstract class BinaryComparison extends BinaryPredicate `
      * `case class EqualTo(left: Expression, right: Expression) extends BinaryComparison `
      * `case class EqualNullSafe(left: Expression, right: Expression) extends BinaryComparison `
      * `case class LessThan(left: Expression, right: Expression) extends BinaryComparison `
      * `case class LessThanOrEqual(left: Expression, right: Expression) extends BinaryComparison `
      * `case class GreaterThan(left: Expression, right: Expression) extends BinaryComparison `
      * `case class GreaterThanOrEqual(left: Expression, right: Expression) extends BinaryComparison `
      * `case class If(predicate: Expression, trueValue: Expression, falseValue: Expression)`
      * `case class CaseWhen(branches: Seq[Expression]) extends Expression `
      * `case class NewSet(elementType: DataType) extends LeafExpression `
      * `case class AddItemToSet(item: Expression, set: Expression) extends Expression `
      * `case class CombineSets(left: Expression, right: Expression) extends BinaryExpression `
      * `case class CountSet(child: Expression) extends UnaryExpression `
      * `trait StringRegexExpression `
      * `trait CaseConversionExpression `
      * `case class Like(left: Expression, right: Expression)`
      * `case class RLike(left: Expression, right: Expression)`
      * `case class Upper(child: Expression) extends UnaryExpression with CaseConversionExpression `
      * `case class Lower(child: Expression) extends UnaryExpression with CaseConversionExpression `
      * `trait StringComparison `
      * `case class Contains(left: Expression, right: Expression)`
      * `case class StartsWith(left: Expression, right: Expression)`
      * `case class EndsWith(left: Expression, right: Expression)`
      * `case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression `
      * `abstract class QueryPlanner[PhysicalPlan <: TreeNode[PhysicalPlan]] `
      * `  abstract protected class Strategy extends Logging `
      * `abstract class QueryPlan[PlanType <: TreeNode[PlanType]] extends TreeNode[PlanType] `
      * `abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging `
      * `  case class Statistics(`
      * `abstract class LeafNode extends LogicalPlan with trees.LeafNode[LogicalPlan] `
      * `abstract class UnaryNode extends LogicalPlan with trees.UnaryNode[LogicalPlan] `
      * `abstract class BinaryNode extends LogicalPlan with trees.BinaryNode[LogicalPlan] `
      * `case class ScriptTransformation(`
      * `case class LocalRelation(output: Seq[Attribute], data: Seq[Product] = Nil)`
      * `case class Project(projectList: Seq[NamedExpression], child: LogicalPlan) extends UnaryNode `
      * `case class Generate(`
      * `case class Filter(condition: Expression, child: LogicalPlan) extends UnaryNode `
      * `case class Union(left: LogicalPlan, right: LogicalPlan) extends BinaryNode `
      * `case class Join(`
      * `case class Except(left: LogicalPlan, right: LogicalPlan) extends BinaryNode `
      * `case class InsertIntoTable(`
      * `case class CreateTableAsSelect(`
      * `case class WriteToFile(`
      * `case class Sort(order: Seq[SortOrder], child: LogicalPlan) extends UnaryNode `
      * `case class Aggregate(`
      * `case class Limit(limitExpr: Expression, child: LogicalPlan) extends UnaryNode `
      * `case class Subquery(alias: String, child: LogicalPlan) extends UnaryNode `
      * `case class Sample(fraction: Double, withReplacement: Boolean, seed: Long, child: LogicalPlan)`
      * `case class Distinct(child: LogicalPlan) extends UnaryNode `
      * `case class Intersect(left: LogicalPlan, right: LogicalPlan) extends BinaryNode `
      * `abstract class Command extends LeafNode `
      * `case class NativeCommand(cmd: String) extends Command `
      * `case class SetCommand(key: Option[String], value: Option[String]) extends Command `
      * `case class ExplainCommand(plan: LogicalPlan, extended: Boolean = false) extends Command `
      * `case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)`
      * `case class UncacheTableCommand(tableName: String) extends Command`
      * `case class DescribeCommand(`
      * `abstract class RedistributeData extends UnaryNode `
      * `case class SortPartitions(sortExpressions: Seq[SortOrder], child: LogicalPlan)`
      * `case class Repartition(partitionExpressions: Seq[Expression], child: LogicalPlan)`
      * `case class ClusteredDistribution(clustering: Seq[Expression]) extends Distribution `
      * `case class OrderedDistribution(ordering: Seq[SortOrder]) extends Distribution `
      * `sealed trait Partitioning `
      * `case class UnknownPartitioning(numPartitions: Int) extends Partitioning `
      * `case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int)`
      * `case class RangePartitioning(ordering: Seq[SortOrder], numPartitions: Int)`
      * `abstract class Rule[TreeType <: TreeNode[_]] extends Logging `
      * `abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging `
      * `  abstract class Strategy `
      * `  case class FixedPoint(maxIterations: Int) extends Strategy`
      * `  protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)`
      * `abstract class TreeNode[BaseType <: TreeNode[BaseType]] `
      * `trait BinaryNode[BaseType <: TreeNode[BaseType]] `
      * `trait LeafNode[BaseType <: TreeNode[BaseType]] `
      * `trait UnaryNode[BaseType <: TreeNode[BaseType]] `
      * `  class TreeNodeRef(val obj: TreeNode[_]) `
      * `abstract class DataType `
      * `trait PrimitiveType extends DataType `
      * `abstract class NativeType extends DataType `
      * `abstract class NumericType extends NativeType with PrimitiveType `
      * `abstract class IntegralType extends NumericType `
      * `abstract class FractionalType extends NumericType `
      * `case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType `
      * `case class StructField(name: String, dataType: DataType, nullable: Boolean) `
      * `case class StructType(fields: Seq[StructField]) extends DataType `
      * `case class MapType(`
      * `  implicit class debugLogging(a: Any) `
      * `public class ArrayType extends DataType `
      * `public class BinaryType extends DataType `
      * `public class BooleanType extends DataType `
      * `public class ByteType extends DataType `
      * `public abstract class DataType `
      * `public class DecimalType extends DataType `
      * `public class DoubleType extends DataType `
      * `public class FloatType extends DataType `
      * `public class IntegerType extends DataType `
      * `public class LongType extends DataType `
      * `public class MapType extends DataType `
      * `public class ShortType extends DataType `
      * `public class StringType extends DataType `
      * `public class StructField `
      * `public class StructType extends DataType `
      * `public class TimestampType extends DataType `
      * `class SQLContext(@transient val sparkContext: SparkContext)`
      * `   *   case class Person(name: String, age: Int)`
      * `  protected[sql] class SparkPlanner extends SparkStrategies `
      * `  protected abstract class QueryExecution `
      * `class SchemaRDD(`
      * `class JavaSQLContext(val sqlContext: SQLContext) extends UDFRegistration `
      * `class JavaSchemaRDD(`
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Decoder[T <: NativeType](buffer: ByteBuffer, columnType: NativeColumnType[T])`
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Decoder[T <: NativeType](buffer: ByteBuffer, columnType: NativeColumnType[T])`
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Decoder[T <: NativeType](buffer: ByteBuffer, columnType: NativeColumnType[T])`
      * `  class Encoder extends compression.Encoder[BooleanType.type] `
      * `  class Decoder(buffer: ByteBuffer) extends compression.Decoder[BooleanType.type] `
      * `  class Encoder extends compression.Encoder[IntegerType.type] `
      * `  class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])`
      * `  class Encoder extends compression.Encoder[LongType.type] `
      * `  class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])`
      * `case class Aggregate(`
      * `  case class ComputedAggregate(`
      * `case class Exchange(newPartitioning: Partitioning, child: SparkPlan) extends UnaryNode `
      * `case class LogicalRDD(output: Seq[Attribute], rdd: RDD[Row])(sqlContext: SQLContext)`
      * `case class PhysicalRDD(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode `
      * `case class ExistingRdd(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode `
      * `case class SparkLogicalPlan(alreadyPlanned: SparkPlan)(@transient sqlContext: SQLContext)`
      * `case class Generate(`
      * `case class AggregateEvaluation(`
      * `case class GeneratedAggregate(`
      * `class QueryExecutionException(message: String) extends Exception(message)`
      * `abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable `
      * `  case class CommandStrategy(context: SQLContext) extends Strategy `
      * `case class Project(projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryNode `
      * `case class Filter(condition: Expression, child: SparkPlan) extends UnaryNode `
      * `case class Sample(fraction: Double, withReplacement: Boolean, seed: Long, child: SparkPlan)`
      * `case class Union(children: Seq[SparkPlan]) extends SparkPlan `
      * `case class Limit(limit: Int, child: SparkPlan)`
      * `case class TakeOrdered(limit: Int, sortOrder: Seq[SortOrder], child: SparkPlan) extends UnaryNode `
      * `case class Sort(`
      * `case class Distinct(partial: Boolean, child: SparkPlan) extends UnaryNode `
      * `case class Except(left: SparkPlan, right: SparkPlan) extends BinaryNode `
      * `case class Intersect(left: SparkPlan, right: SparkPlan) extends BinaryNode `
      * `case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan `
      * `trait Command `
      * `case class SetCommand(`
      * `case class ExplainCommand(`
      * `case class CacheTableCommand(`
      * `case class UncacheTableCommand(tableName: String) extends LeafNode with Command `
      * `case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(`
      * `  implicit class DebugQuery(query: SchemaRDD) `
      * `    case class ColumnMetrics(`
      * `trait HashJoin `
      * `case class HashOuterJoin(`
      * `case class ShuffledHashJoin(`
      * `case class LeftSemiJoinHash(`
      * `case class BroadcastHashJoin(`
      * `case class LeftSemiJoinBNL(`
      * `case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNode `
      * `case class BroadcastNestedLoopJoin(`
      * `case class EvaluatePython(udf: PythonUDF, child: LogicalPlan) extends logical.UnaryNode `
      * `case class BatchPythonEvaluation(udf: PythonUDF, output: Seq[Attribute], child: SparkPlan)`
      * `case class ParquetTableScan(`
      * `case class InsertIntoParquetTable(`
      * `  protected case class Keyword(str: String)`
      * `class LocalHiveContext(sc: SparkContext) extends HiveContext(sc) `
      * `class HiveContext(sc: SparkContext) extends SQLContext(sc) `
      * `  protected[sql] abstract class QueryExecution extends super.QueryExecution `
      * `  implicit class typeInfoConversions(dt: DataType) `
      * `  implicit class SchemaAttribute(f: FieldSchema) `
      * `  implicit class TransformableNode(n: ASTNode) `
      * `  class ParseException(sql: String, cause: Throwable)`
      * `  class SemanticException(msg: String)`
      * `    implicit class LogicalPlanHacks(s: SchemaRDD) `
      * `    implicit class PhysicalPlanHacks(originalPlan: SparkPlan) `
      * `  case class HiveCommandStrategy(context: HiveContext) extends Strategy `
      * `class HadoopTableReader(`
      * `class TestHiveContext(sc: SparkContext) extends HiveContext(sc) `
      * `  protected[hive] class HiveQLQueryExecution(hql: String) extends this.QueryExecution `
      * `  abstract class QueryExecution extends super.QueryExecution `
      * `  case class TestTable(name: String, commands: (()=>Unit)*)`
      * `  protected[hive] implicit class SqlCmd(sql: String) `
      * `class JavaHiveContext(sparkContext: JavaSparkContext) extends JavaSQLContext(sparkContext) `
      * `case class CreateTableAsSelect(`
      * `case class DescribeHiveTableCommand(`
      * `case class HiveTableScan(`
      * `case class InsertIntoHiveTable(`
      * `    assert(valueClass != null, "Output value class not set")`
      * `    assert(outputFileFormatClassName != null, "Output format class not set")`
      * `case class NativeCommand(`
      * `case class ScriptTransformation(`
      * `case class AnalyzeTable(tableName: String) extends LeafNode with Command `
      * `case class DropTable(tableName: String, ifExists: Boolean) extends LeafNode with Command `
      * `case class AddJar(path: String) extends LeafNode with Command `
      * `  class DeferredObjectAdapter extends DeferredObject `
      * `  protected class UDTFCollector extends Collector `
      * `class FakeParquetSerDe extends SerDe `
      * `case class ParameterizedType(override val name: String,`
      * `case class SparkMethod(name: String, returnType: SparkType, parameters: Seq[SparkType]) `
      * `class ExecutorRunnable(`
      * `                logInfo("Interrupting user class to stop.")`
      * `class ApplicationMasterArguments(val args: Array[String]) `
      * `      "  --class CLASS_NAME         Name of your application's main class (required)
    " +`
      * `trait ExecutorRunnableUtil extends Logging `
      * `  protected trait YarnAllocateResponse `
      * `trait YarnRMClient `
      * `class YarnSparkHadoopUtil extends SparkHadoopUtil `
      * `class ExecutorRunnable(`
      * `class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129869
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    +   * ground truth set is less than k.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision, must be positive
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val n = math.min(pred.length, k)
    +      var i = 0
    +      var cnt = 0
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +        }
    +        i += 1
    +      }
    +      if (labSet.size == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        cnt.toDouble / k
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries.
    +   * If a query has an empty ground truth set, the average precision will be zero and a log
    +   * warining is generated.
    +   */
    +  lazy val meanAveragePrecision: Double = {
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val labSetSize = labSet.size
    +      var i = 0
    +      var cnt = 0
    +      var precSum = 0.0
    +      val n = pred.length
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +          precSum += cnt.toDouble / (i + 1)
    +        }
    +        i += 1
    +      }
    +      if (labSetSize == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        precSum / labSet.size
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * The discounted cumulative gain at position k is computed as:
    +   *    \sum_{i=1}^k (2^{relevance of ith item} - 1) / log(i + 1),
    +   * and the NDCG is obtained by dividing the DCG value on the ground truth set. In the current
    +   * implementation, the relevance value is binary.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at position n
    +   * will be used. If the ground truth set contains n (n < k) results, the first n items will be
    +   * used to compute the DCG value on the ground truth set.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated ndcg, must be positive
    +   * @return the average ndcg at the first k ranking positions
    +   */
    +  def ndcgAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val labSetSize = labSet.size
    +      val n = math.min(math.max(pred.length, labSetSize), k)
    +      var maxDcg = 0.0
    +      var dcg = 0.0
    +      var i = 0
    +
    +      while (i < n) {
    +        val gain = 1.0 / math.log(i + 2)
    +        if (labSet.contains(pred(i))) {
    +          dcg += gain
    +        }
    +        if (i < labSetSize) {
    +          maxDcg += gain
    +        }
    +        i += 1
    +      }
    +      if (labSetSize == 0) {
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59793960
  
    Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19046948
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, k)
    +    var i = 0
    +    var cnt = 0
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      i += 1
    +    }
    +    cnt.toDouble / k
    +  }.mean
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var i = 0
    +    var cnt = 0
    +    var precSum = 0.0
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    --- End diff --
    
    `labSet` could be empty.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129850
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    --- End diff --
    
    `retrived` -> `retrieved`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59976043
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21998/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18875995
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet = lab.toSet
    +    val n = pred.length
    +    val topKPrec = Array.fill[Double](n)(0.0)
    +    var (i, cnt) = (0, 0)
    --- End diff --
    
    ~~~
    var i = 0
    var cnt = 0
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18565240
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    --- End diff --
    
    ... and if the code is just going to turn the arguments into `Set` then it need not be an `RDD` of `Array`, but something more generic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18447346
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet : Set[Double] = lab.toSet
    --- End diff --
    
    Given my previous comment, maybe I'm missing something, but isn't one of the two arguments always going to be 1 to n? either you are ranking the predicted top n versus real rankings, or evaluating the predicted ranking of the known top n... ? I think I would have expected the input to be the predicted top n items by ID or something, and the IDs of the real top n, and then making a set and `contains` makes some sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59984488
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21997/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2667


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18876004
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet = lab.toSet
    +    val n = pred.length
    +    val topKPrec = Array.fill[Double](n)(0.0)
    +    var (i, cnt) = (0, 0)
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      topKPrec(i) = cnt.toDouble / (i + 1)
    +      i += 1
    +    }
    +    topKPrec
    +  }
    +
    +  /**
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precision(k: Int): Double = precAtK.map {topKPrec =>
    +    val n = topKPrec.length
    +    if (k <= n) {
    +      topKPrec(k - 1)
    +    } else {
    +      topKPrec(n - 1) * n / k
    +    }
    +  }.mean
    +
    +  /**
    +   * Returns the average precision for each query
    +   */
    +  lazy val avePrec: RDD[Double] = predictionAndLabels.map {case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var (i, cnt, precSum) = (0, 0, 0.0)
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAvePrec: Double = avePrec.mean
    +
    +  /**
    +   * Returns the normalized discounted cumulative gain for each query
    +   */
    +  lazy val ndcgAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val labSetSize = labSet.size
    +    val n = math.max(pred.length, labSetSize)
    +    val topKNdcg = Array.fill[Double](n)(0.0)
    +    var (maxDcg, dcg, i) = (0.0, 0.0, 0)
    +    while (i < n) {
    +      /* Calculate 1/log2(i + 2) */
    +      val gain = math.log(2) / math.log(i + 2)
    +      if (labSet.contains(pred(i))) {
    +        dcg += gain
    +      }
    +      if (i < labSetSize) {
    +        maxDcg += gain
    +      }
    +      topKNdcg(i) = dcg / maxDcg
    +      i += 1
    +    }
    +    topKNdcg
    +  }
    +
    +  /**
    +   * @param k the position to compute the truncated ndcg
    --- End diff --
    
    Need a one sentence summary about the method. Also need a link to the paper that describes how to handle the case when we don't have k items.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58274726
  
    @coderxiang `k` should be a configurable parameter in ranking metrics, e.g., `prec@1` and `prec@5`. The API could be
    
    ~~~
    def precision(k: Int): Double
    def ndcg(k: Int): Double
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18469803
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    --- End diff --
    
    Yes, I get it. But these could also just as easily be Strings maybe? like document IDs? anything you can put into a set and look for. Could this even be generified to accept any type? At least, `Double` seemed like the least likely type for an ID. It's not even what MLlib overloads to mean "ID"; for example ALS assumes `Int` IDs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by coderxiang <gi...@git.apache.org>.

Github user coderxiang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18467711
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    --- End diff --
    
    The current implementation only considers binary relevance, meaning the input is just labels instead of scores. It is true that `Int` is enough. IMHO, using `Double` is compatible with the current mllib setting and could be extended to deal with the score setting. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by coderxiang <gi...@git.apache.org>.

Github user coderxiang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18467987
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    --- End diff --
    
    I think I should check non-empty. It is not necessary that these two have same length, as the first array contains all the items we believe to be relevant, it could be more or less than the ground truth set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129857
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    +   * ground truth set is less than k.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision, must be positive
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val n = math.min(pred.length, k)
    +      var i = 0
    +      var cnt = 0
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +        }
    +        i += 1
    +      }
    +      if (labSet.size == 0) {
    --- End diff --
    
    If `labSet` is empty, the `while` loop is wasted.
    
    ~~~
    if (labelSet.nonEmpty) {
      val n = math.min(...)
      ...
      cnt.toDouble / k
    } else {
      logWarning("...")
      0.0
    }
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58708698
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21600/consoleFull) for   PR 2667 at commit [`5f87bce`](https://github.com/apache/spark/commit/5f87bce822af928c185a9a3fff28b004b138f955).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-57979727
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21318/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18876000
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet = lab.toSet
    +    val n = pred.length
    +    val topKPrec = Array.fill[Double](n)(0.0)
    +    var (i, cnt) = (0, 0)
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      topKPrec(i) = cnt.toDouble / (i + 1)
    +      i += 1
    +    }
    +    topKPrec
    +  }
    +
    +  /**
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precision(k: Int): Double = precAtK.map {topKPrec =>
    +    val n = topKPrec.length
    +    if (k <= n) {
    +      topKPrec(k - 1)
    +    } else {
    +      topKPrec(n - 1) * n / k
    +    }
    +  }.mean
    +
    +  /**
    +   * Returns the average precision for each query
    +   */
    +  lazy val avePrec: RDD[Double] = predictionAndLabels.map {case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var (i, cnt, precSum) = (0, 0, 0.0)
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAvePrec: Double = avePrec.mean
    --- End diff --
    
    `meanAvePrec` is not a common acronym for MAP. Let's use `meanAveragePrecision`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59794882
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21920/consoleFull) for   PR 2667 at commit [`d64c120`](https://github.com/apache/spark/commit/d64c1201e439d2894a76196659c59b9abb03be5e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58701203
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21600/consoleFull) for   PR 2667 at commit [`5f87bce`](https://github.com/apache/spark/commit/5f87bce822af928c185a9a3fff28b004b138f955).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19129859
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,157 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
    +  extends Logging with Serializable {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   *
    +   * If for a query, the ranking algorithm returns n (n < k) results, the precision value will be
    +   * computed as #(relevant items retrived) / k. This formula also applies when the size of the
    +   * ground truth set is less than k.
    +   *
    +   * If a query has an empty ground truth set, zero will be returned together with a log warning.
    +   *
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision, must be positive
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = {
    +    require (k > 0,"ranking position k should be positive")
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val n = math.min(pred.length, k)
    +      var i = 0
    +      var cnt = 0
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +        }
    +        i += 1
    +      }
    +      if (labSet.size == 0) {
    +        logWarning("Empty ground truth set, check input data")
    +        0.0
    +      } else {
    +        cnt.toDouble / k
    +      }
    +    }.mean
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries.
    +   * If a query has an empty ground truth set, the average precision will be zero and a log
    +   * warining is generated.
    +   */
    +  lazy val meanAveragePrecision: Double = {
    +    predictionAndLabels.map { case (pred, lab) =>
    +      val labSet = lab.toSet
    +      val labSetSize = labSet.size
    +      var i = 0
    +      var cnt = 0
    +      var precSum = 0.0
    +      val n = pred.length
    +
    +      while (i < n) {
    +        if (labSet.contains(pred(i))) {
    +          cnt += 1
    +          precSum += cnt.toDouble / (i + 1)
    +        }
    +        i += 1
    +      }
    +      if (labSetSize == 0) {
    --- End diff --
    
    ditto (do not go through the while loop if labSet is empty)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59464729
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18447409
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet : Set[Double] = lab.toSet
    +    val n = pred.length
    +    val topkPrec = Array.fill[Double](n)(.0)
    +    var (i, cnt) = (0, 0)
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      topkPrec(i) = cnt.toDouble / (i + 1)
    +      i += 1
    +    }
    +    topkPrec
    +  }
    +
    +  /**
    +   * Returns the average precision for each query
    +   */
    +  lazy val avePrec: RDD[Double] = predictionAndLabels.map {case (pred, lab) =>
    +    val labSet: Set[Double] = lab.toSet
    +    var (i, cnt, precSum) = (0, 0, .0)
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAvePrec: Double = computeMean(avePrec)
    +
    +  /**
    +   * Returns the normalized discounted cumulative gain for each query
    +   */
    +  lazy val ndcg: RDD[Double] = predictionAndLabels.map {case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, labSet.size)
    +    var (maxDcg, dcg, i) = (.0, .0, 0)
    +    while (i < n) {
    +      /* Calculate 1/log2(i + 2) */
    +      val gain = 1.0 / (math.log(i + 2) / math.log(2))
    +      if (labSet.contains(pred(i))) {
    +        dcg += gain
    +      }
    +      maxDcg += gain
    +      i += 1
    +    }
    +    dcg / maxDcg
    +  }
    +
    +  /**
    +   * Returns the mean NDCG of all the queries
    +   */
    +  lazy val meanNdcg: Double = computeMean(ndcg)
    +
    +  private def computeMean(data: RDD[Double]): Double = {
    --- End diff --
    
    `RDD[Double]` already has a `mean()` method; no need to reimplement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-60010387
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22009/consoleFull) for   PR 2667 at commit [`d881097`](https://github.com/apache/spark/commit/d88109767c1ffd853b2b03c33c8d722353fd39ce).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58056900
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21333/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19183393
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    --- End diff --
    
    @srowen We can optimize the implementation later, if it is worth doing, and using Array may help. Usually, both predictions and labels are small. Scanning it sequentially may be faster than `toSeq` and `contains`. Array is also consistent across Java, Scala, and Python, and storage efficient for primitive types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r18875993
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Returns the precsion@k for each query
    +   */
    +  lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=>
    +    val labSet = lab.toSet
    +    val n = pred.length
    +    val topKPrec = Array.fill[Double](n)(0.0)
    --- End diff --
    
    new Array[Double](n)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19046949
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, k)
    +    var i = 0
    +    var cnt = 0
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      i += 1
    +    }
    +    cnt.toDouble / k
    +  }.mean
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var i = 0
    +    var cnt = 0
    +    var precSum = 0.0
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }.mean
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at
    +   * at position n will be used. See the following paper for detail:
    --- End diff --
    
    ditto: what if label set contains less than k items? Need doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58045965
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21333/consoleFull) for   PR 2667 at commit [`e443fee`](https://github.com/apache/spark/commit/e443feef8321dfc26eb5630f4d3b3dc0c6b79b98).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19047629
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, k)
    +    var i = 0
    +    var cnt = 0
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      i += 1
    +    }
    +    cnt.toDouble / k
    +  }.mean
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var i = 0
    +    var cnt = 0
    +    var precSum = 0.0
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }.mean
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at
    +   * at position n will be used. See the following paper for detail:
    --- End diff --
    
    Btw, there are variants of NDCG definition. We need to say in the doc which version we implement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19046956
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, k)
    +    var i = 0
    +    var cnt = 0
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      i += 1
    +    }
    +    cnt.toDouble / k
    +  }.mean
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var i = 0
    +    var cnt = 0
    +    var precSum = 0.0
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }.mean
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at
    +   * at position n will be used. See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated ndcg
    +   * @return the average ndcg at the first k ranking positions
    +   */
    +  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val labSetSize = labSet.size
    +    val n = math.min(math.max(pred.length, labSetSize), k)
    +    var maxDcg = 0.0
    +    var dcg = 0.0
    +    var i = 0
    +
    +    while (i < n) {
    +      // Calculate 1/log2(i + 2)
    +      val gain = math.log(2) / math.log(i + 2)
    +      if (labSet.contains(pred(i))) {
    +        dcg += gain
    +      }
    +      if (i < labSetSize) {
    +        maxDcg += gain
    +      }
    +      i += 1
    +    }
    +    dcg / maxDcg
    --- End diff --
    
    `maxDcg` could be zero. Please add a test for this corner case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-57976456
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21318/consoleFull) for   PR 2667 at commit [`3a5a6ff`](https://github.com/apache/spark/commit/3a5a6ffdb036f8432911184920193a4b8a007084).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-59810520
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21921/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/2667#issuecomment-58334664
  
    @mengxr Yes I understand these metrics. Precision / recall are binary classifier metrics at heart (but not nDCG for example). Precision@k needs ranking. That's why precision/recall turn up in `MulticlassMetrics` as a special case -- you can compute precision of class-_i_ vs not-class-_i_. 
    
    I understand it is two views on the same metric, and these are collections of different metrics. Hm, maybe this makes more sense if the metrics here are clearly precision@k, recall@k, with an argument _k_. The other class has F-measure, true positive rate, etc. I suppose those could be implemented here too for consistency, but don't know how useful they are. 
    
    Meh, anything to rationalize the purposeful difference between these two would help. Maybe it's just me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2667#discussion_r19046954
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * ::Experimental::
    + * Evaluator for ranking algorithms.
    + *
    + * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
    + */
    +@Experimental
    +class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) {
    +
    +  /**
    +   * Compute the average precision of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results,
    +   * the precision value will be computed as #(relevant items retrived) / k.
    +   * See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated precision
    +   * @return the average precision at the first k ranking positions
    +   */
    +  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val n = math.min(pred.length, k)
    +    var i = 0
    +    var cnt = 0
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +      }
    +      i += 1
    +    }
    +    cnt.toDouble / k
    +  }.mean
    +
    +  /**
    +   * Returns the mean average precision (MAP) of all the queries
    +   */
    +  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    var i = 0
    +    var cnt = 0
    +    var precSum = 0.0
    +    val n = pred.length
    +
    +    while (i < n) {
    +      if (labSet.contains(pred(i))) {
    +        cnt += 1
    +        precSum += cnt.toDouble / (i + 1)
    +      }
    +      i += 1
    +    }
    +    precSum / labSet.size
    +  }.mean
    +
    +  /**
    +   * Compute the average NDCG value of all the queries, truncated at ranking position k.
    +   * If for a query, the ranking algorithm returns n (n < k) results, the NDCG value at
    +   * at position n will be used. See the following paper for detail:
    +   *
    +   * IR evaluation methods for retrieving highly relevant documents.
    +   *    K. Jarvelin and J. Kekalainen
    +   *
    +   * @param k the position to compute the truncated ndcg
    +   * @return the average ndcg at the first k ranking positions
    +   */
    +  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) =>
    +    val labSet = lab.toSet
    +    val labSetSize = labSet.size
    +    val n = math.min(math.max(pred.length, labSetSize), k)
    +    var maxDcg = 0.0
    +    var dcg = 0.0
    +    var i = 0
    +
    +    while (i < n) {
    +      // Calculate 1/log2(i + 2)
    --- End diff --
    
    the comment doesn't provide any extra information


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org