You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Sean Owen <so...@cloudera.com> on 2014/11/02 18:34:49 UTC

OOM when making bins in BinaryClassificationMetrics ?

This might be a question for Xiangrui. Recently I was using
BinaryClassificationMetrics to build an AUC curve for a classifier
over a reasonably large number of points (~12M). The scores were all
probabilities, so tended to be almost entirely unique.

The computation does some operations by key, and this ran out of
memory. It's something you can solve with more than the default amount
of memory, but in this case, it seemed unuseful to create an AUC curve
with such fine-grained resolution.

I ended up just binning the scores so there were ~1000 unique values
and then it was fine.

Does that sound generally useful as some kind of parameter? or am I
missing a trick here.

Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: OOM when making bins in BinaryClassificationMetrics ?

Posted by Sean Owen <so...@cloudera.com>.
Agree, just rounding only makes sense if the values are sort of evenly
distributed -- in my case they were in 0,1. I will put it on my to-do
list to look at, yes. Thanks for the confirmation.

On Sun, Nov 2, 2014 at 7:44 PM, Xiangrui Meng <me...@gmail.com> wrote:
> Yes, if there are many distinct values, we need binning to compute the
> AUC curve. Usually, the scores are not evenly distribution, we cannot
> simply truncate the digits. Estimating the quantiles for binning is
> necessary, similar to RangePartitioner:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
> . Limiting the number of bins is definitely useful. Do you have time
> to work on it? -Xiangrui
>
> On Sun, Nov 2, 2014 at 9:34 AM, Sean Owen <so...@cloudera.com> wrote:
>> This might be a question for Xiangrui. Recently I was using
>> BinaryClassificationMetrics to build an AUC curve for a classifier
>> over a reasonably large number of points (~12M). The scores were all
>> probabilities, so tended to be almost entirely unique.
>>
>> The computation does some operations by key, and this ran out of
>> memory. It's something you can solve with more than the default amount
>> of memory, but in this case, it seemed unuseful to create an AUC curve
>> with such fine-grained resolution.
>>
>> I ended up just binning the scores so there were ~1000 unique values
>> and then it was fine.
>>
>> Does that sound generally useful as some kind of parameter? or am I
>> missing a trick here.
>>
>> Sean
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: OOM when making bins in BinaryClassificationMetrics ?

Posted by Xiangrui Meng <me...@gmail.com>.
Yes, if there are many distinct values, we need binning to compute the
AUC curve. Usually, the scores are not evenly distribution, we cannot
simply truncate the digits. Estimating the quantiles for binning is
necessary, similar to RangePartitioner:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
. Limiting the number of bins is definitely useful. Do you have time
to work on it? -Xiangrui

On Sun, Nov 2, 2014 at 9:34 AM, Sean Owen <so...@cloudera.com> wrote:
> This might be a question for Xiangrui. Recently I was using
> BinaryClassificationMetrics to build an AUC curve for a classifier
> over a reasonably large number of points (~12M). The scores were all
> probabilities, so tended to be almost entirely unique.
>
> The computation does some operations by key, and this ran out of
> memory. It's something you can solve with more than the default amount
> of memory, but in this case, it seemed unuseful to create an AUC curve
> with such fine-grained resolution.
>
> I ended up just binning the scores so there were ~1000 unique values
> and then it was fine.
>
> Does that sound generally useful as some kind of parameter? or am I
> missing a trick here.
>
> Sean
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org