You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jong Wook Kim <jo...@nyu.edu> on 2016/09/19 03:42:25 UTC

Is RankingMetrics' NDCG implementation correct?

Hi,

I'm trying to evaluate a recommendation model, and found that Spark and
Rival <http://dl.acm.org/citation.cfm?id=2645712> give different results,
and it seems that Rival's one is what Kaggle defines
<https://www.kaggle.com/wiki/NormalizedDiscountedCumulativeGain>:
https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597

Am I using RankingMetrics in a wrong way, or is Spark's implementation
incorrect?

To my knowledge, NDCG should be dependent on the relevance (or preference)
values, but Spark's implementation
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L129-L156>
seems not; it uses 1.0 where it should be 2^(relevance) - 1, probably
assuming that relevance is all 1.0? I also tried tweaking, but its method
to obtain the ideal DCG also seems wrong.

Any feedback from MLlib developers would be appreciated. I made a
modified/extended version of RankingMetrics that produces the identical
numbers to Kaggle and Rival's results, and I'm wondering if it is something
appropriate to be added back to MLlib.

Jong Wook

Re: Is RankingMetrics' NDCG implementation correct?

Posted by Nick Pentreath <ni...@gmail.com>.

(cc'ing dev list also)

I think a more general version of ranking metrics that allows arbitrary
relevance scores could be useful. Ranking metrics are applicable to other
settings like search or other learning-to-rank use cases, so it should be a
little more generic than pure recommender settings.

The one issue with the proposed implementation is that it is not compatible
with the existing cross-validators within a pipeline.

As I've mentioned on the linked JIRAs & PRs, one option is to create a
special set of cross-validators for recommenders, that address the issues
of (a) dataset splitting specific to recommender settings (user-based
stratified sampling, time-based etc) and (b) ranking-based evaluation.

The other option is to have the ALSModel itself capable of generating the
"ground-truth" set within the same dataframe output from "transform" (ie
predict top k) that can be fed into the cross-validator (with
RankingEvaluator) directly. That's the approach I took so far in
https://github.com/apache/spark/pull/12574.

Both options are valid and have their positives & negatives - open to
comments / suggestions.

On Tue, 20 Sep 2016 at 06:08 Jong Wook Kim <jo...@nyu.edu> wrote:

> Thanks for the clarification and the relevant links. I overlooked the
> comments explicitly saying that the relevance is binary.
>
> I understand that the label is not a relevance, but I have been, and I
> think many people are using the label as relevance in the implicit feedback
> context where the user-provided exact label is not defined anyway. I think
> that's why RiVal <https://github.com/recommenders/rival>'s using the term
> "preference" for both the label for MAE and the relevance for NDCG.
>
> At the same time, I see why Spark decided to assume the relevance is
> binary, in part to conform to the class RankingMetrics's constructor. I
> think it would be nice if the upcoming DataFrame-based RankingEvaluator can
> be optionally set a "relevance column" that has non-binary relevance
> values, otherwise defaulting to either 1.0 or the label column.
>
> My extended version of RankingMetrics is here:
> https://github.com/jongwook/spark-ranking-metrics . It has a test case
> checking that the numbers are same as RiVal's.
>
> Jong Wook
>
>
>
> On 19 September 2016 at 03:13, Sean Owen <so...@cloudera.com> wrote:
>
>> Yes, relevance is always 1. The label is not a relevance score so
>> don't think it's valid to use it as such.
>>
>> On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim <jo...@nyu.edu> wrote:
>> > Hi,
>> >
>> > I'm trying to evaluate a recommendation model, and found that Spark and
>> > Rival give different results, and it seems that Rival's one is what
>> Kaggle
>> > defines:
>> https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>> >
>> > Am I using RankingMetrics in a wrong way, or is Spark's implementation
>> > incorrect?
>> >
>> > To my knowledge, NDCG should be dependent on the relevance (or
>> preference)
>> > values, but Spark's implementation seems not; it uses 1.0 where it
>> should be
>> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also
>> tried
>> > tweaking, but its method to obtain the ideal DCG also seems wrong.
>> >
>> > Any feedback from MLlib developers would be appreciated. I made a
>> > modified/extended version of RankingMetrics that produces the identical
>> > numbers to Kaggle and Rival's results, and I'm wondering if it is
>> something
>> > appropriate to be added back to MLlib.
>> >
>> > Jong Wook
>>
>
>

Re: Is RankingMetrics' NDCG implementation correct?

Posted by Nick Pentreath <ni...@gmail.com>.

(cc'ing dev list also)

I think a more general version of ranking metrics that allows arbitrary
relevance scores could be useful. Ranking metrics are applicable to other
settings like search or other learning-to-rank use cases, so it should be a
little more generic than pure recommender settings.

The one issue with the proposed implementation is that it is not compatible
with the existing cross-validators within a pipeline.

As I've mentioned on the linked JIRAs & PRs, one option is to create a
special set of cross-validators for recommenders, that address the issues
of (a) dataset splitting specific to recommender settings (user-based
stratified sampling, time-based etc) and (b) ranking-based evaluation.

The other option is to have the ALSModel itself capable of generating the
"ground-truth" set within the same dataframe output from "transform" (ie
predict top k) that can be fed into the cross-validator (with
RankingEvaluator) directly. That's the approach I took so far in
https://github.com/apache/spark/pull/12574.

Both options are valid and have their positives & negatives - open to
comments / suggestions.

On Tue, 20 Sep 2016 at 06:08 Jong Wook Kim <jo...@nyu.edu> wrote:

> Thanks for the clarification and the relevant links. I overlooked the
> comments explicitly saying that the relevance is binary.
>
> I understand that the label is not a relevance, but I have been, and I
> think many people are using the label as relevance in the implicit feedback
> context where the user-provided exact label is not defined anyway. I think
> that's why RiVal <https://github.com/recommenders/rival>'s using the term
> "preference" for both the label for MAE and the relevance for NDCG.
>
> At the same time, I see why Spark decided to assume the relevance is
> binary, in part to conform to the class RankingMetrics's constructor. I
> think it would be nice if the upcoming DataFrame-based RankingEvaluator can
> be optionally set a "relevance column" that has non-binary relevance
> values, otherwise defaulting to either 1.0 or the label column.
>
> My extended version of RankingMetrics is here:
> https://github.com/jongwook/spark-ranking-metrics . It has a test case
> checking that the numbers are same as RiVal's.
>
> Jong Wook
>
>
>
> On 19 September 2016 at 03:13, Sean Owen <so...@cloudera.com> wrote:
>
>> Yes, relevance is always 1. The label is not a relevance score so
>> don't think it's valid to use it as such.
>>
>> On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim <jo...@nyu.edu> wrote:
>> > Hi,
>> >
>> > I'm trying to evaluate a recommendation model, and found that Spark and
>> > Rival give different results, and it seems that Rival's one is what
>> Kaggle
>> > defines:
>> https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>> >
>> > Am I using RankingMetrics in a wrong way, or is Spark's implementation
>> > incorrect?
>> >
>> > To my knowledge, NDCG should be dependent on the relevance (or
>> preference)
>> > values, but Spark's implementation seems not; it uses 1.0 where it
>> should be
>> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also
>> tried
>> > tweaking, but its method to obtain the ideal DCG also seems wrong.
>> >
>> > Any feedback from MLlib developers would be appreciated. I made a
>> > modified/extended version of RankingMetrics that produces the identical
>> > numbers to Kaggle and Rival's results, and I'm wondering if it is
>> something
>> > appropriate to be added back to MLlib.
>> >
>> > Jong Wook
>>
>
>

Re: Is RankingMetrics' NDCG implementation correct?

Posted by Jong Wook Kim <jo...@nyu.edu>.

Thanks for the clarification and the relevant links. I overlooked the
comments explicitly saying that the relevance is binary.

I understand that the label is not a relevance, but I have been, and I
think many people are using the label as relevance in the implicit feedback
context where the user-provided exact label is not defined anyway. I think
that's why RiVal <https://github.com/recommenders/rival>'s using the term
"preference" for both the label for MAE and the relevance for NDCG.

At the same time, I see why Spark decided to assume the relevance is
binary, in part to conform to the class RankingMetrics's constructor. I
think it would be nice if the upcoming DataFrame-based RankingEvaluator can
be optionally set a "relevance column" that has non-binary relevance
values, otherwise defaulting to either 1.0 or the label column.

My extended version of RankingMetrics is here:
https://github.com/jongwook/spark-ranking-metrics . It has a test case
checking that the numbers are same as RiVal's.

Jong Wook

On 19 September 2016 at 03:13, Sean Owen <so...@cloudera.com> wrote:

> Yes, relevance is always 1. The label is not a relevance score so
> don't think it's valid to use it as such.
>
> On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim <jo...@nyu.edu> wrote:
> > Hi,
> >
> > I'm trying to evaluate a recommendation model, and found that Spark and
> > Rival give different results, and it seems that Rival's one is what
> Kaggle
> > defines: https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e5
> 97
> >
> > Am I using RankingMetrics in a wrong way, or is Spark's implementation
> > incorrect?
> >
> > To my knowledge, NDCG should be dependent on the relevance (or
> preference)
> > values, but Spark's implementation seems not; it uses 1.0 where it
> should be
> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also
> tried
> > tweaking, but its method to obtain the ideal DCG also seems wrong.
> >
> > Any feedback from MLlib developers would be appreciated. I made a
> > modified/extended version of RankingMetrics that produces the identical
> > numbers to Kaggle and Rival's results, and I'm wondering if it is
> something
> > appropriate to be added back to MLlib.
> >
> > Jong Wook
>

Re: Is RankingMetrics' NDCG implementation correct?

Posted by Sean Owen <so...@cloudera.com>.

Yes, relevance is always 1. The label is not a relevance score so
don't think it's valid to use it as such.

On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim <jo...@nyu.edu> wrote:
> Hi,
>
> I'm trying to evaluate a recommendation model, and found that Spark and
> Rival give different results, and it seems that Rival's one is what Kaggle
> defines: https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>
> Am I using RankingMetrics in a wrong way, or is Spark's implementation
> incorrect?
>
> To my knowledge, NDCG should be dependent on the relevance (or preference)
> values, but Spark's implementation seems not; it uses 1.0 where it should be
> 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also tried
> tweaking, but its method to obtain the ideal DCG also seems wrong.
>
> Any feedback from MLlib developers would be appreciated. I made a
> modified/extended version of RankingMetrics that produces the identical
> numbers to Kaggle and Rival's results, and I'm wondering if it is something
> appropriate to be added back to MLlib.
>
> Jong Wook

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Is RankingMetrics' NDCG implementation correct?

Posted by Nick Pentreath <ni...@gmail.com>.

The PR already exists for adding RankingEvaluator to ML -
https://github.com/apache/spark/pull/12461. I need to revive and review it.
DB, your review would be welcome too (and also on
https://github.com/apache/spark/issues/12574 which has implications for the
semantics of ranking metrics in the DataFrame style API).

Also see this discussion here -
https://github.com/apache/spark/pull/12461#discussion-diff-60469791 -
comment welcome.

N

On Mon, 19 Sep 2016 at 06:37 DB Tsai <db...@dbtsai.com> wrote:

> Hi Jong,
>
> I think the definition from Kaggle is correct. I'm working on
> implementing ranking metrics in Spark ML now, but the timeline is
> unknown. Feel free to submit a PR for this in MLlib.
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Sun, Sep 18, 2016 at 8:42 PM, Jong Wook Kim <jo...@nyu.edu> wrote:
> > Hi,
> >
> > I'm trying to evaluate a recommendation model, and found that Spark and
> > Rival give different results, and it seems that Rival's one is what
> Kaggle
> > defines:
> https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
> >
> > Am I using RankingMetrics in a wrong way, or is Spark's implementation
> > incorrect?
> >
> > To my knowledge, NDCG should be dependent on the relevance (or
> preference)
> > values, but Spark's implementation seems not; it uses 1.0 where it
> should be
> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also
> tried
> > tweaking, but its method to obtain the ideal DCG also seems wrong.
> >
> > Any feedback from MLlib developers would be appreciated. I made a
> > modified/extended version of RankingMetrics that produces the identical
> > numbers to Kaggle and Rival's results, and I'm wondering if it is
> something
> > appropriate to be added back to MLlib.
> >
> > Jong Wook
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Is RankingMetrics' NDCG implementation correct?

Posted by DB Tsai <db...@dbtsai.com>.

Hi Jong,

I think the definition from Kaggle is correct. I'm working on
implementing ranking metrics in Spark ML now, but the timeline is
unknown. Feel free to submit a PR for this in MLlib.

Thanks.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Sun, Sep 18, 2016 at 8:42 PM, Jong Wook Kim <jo...@nyu.edu> wrote:
> Hi,
>
> I'm trying to evaluate a recommendation model, and found that Spark and
> Rival give different results, and it seems that Rival's one is what Kaggle
> defines: https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>
> Am I using RankingMetrics in a wrong way, or is Spark's implementation
> incorrect?
>
> To my knowledge, NDCG should be dependent on the relevance (or preference)
> values, but Spark's implementation seems not; it uses 1.0 where it should be
> 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also tried
> tweaking, but its method to obtain the ideal DCG also seems wrong.
>
> Any feedback from MLlib developers would be appreciated. I made a
> modified/extended version of RankingMetrics that produces the identical
> numbers to Kaggle and Rival's results, and I'm wondering if it is something
> appropriate to be added back to MLlib.
>
> Jong Wook

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org