You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SK <sk...@gmail.com> on 2014/07/29 03:07:38 UTC

evaluating classification accuracy

Hi,

In order to evaluate the ML classification accuracy, I am zipping up the
prediction and test labels as follows and then comparing the pairs in
predictionAndLabel:

val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))


However, I am finding that predictionAndLabel.count() has fewer elements
than test.count().  For example, my test vector has 43 elements, but
predictionAndLabel has only 38 pairs. I have tried other samples and always
get fewer elements after zipping. 

Does zipping the two vectors cause any compression? or is this because of
the distributed nature of the algorithm (I am running it in local mode on a
single machine). In order to get the correct accuracy, I need the above
comparison to be done by a single node on the entire test data (my data is
quite small). How can I ensure that?

thanks 






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: evaluating classification accuracy

Posted by SK <sk...@gmail.com>.

I am using 1.0.1 and I am running locally (I am not providing any master
URL). But the zip() does not produce the correct count as I mentioned above.
So not sure if the issue has been fixed in 1.0.1. However, instead of using
zip, I am now using the code that Sean has mentioned and am getting the
correct count. So the issue is resolved.

thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822p10980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: evaluating classification accuracy

Posted by Sean Owen <so...@cloudera.com>.

Yes, in addition, I think Xiangrui updated the examples anyhow to use
a different form that does not rely on zip:

test.map(v => (model.predict(v.features), v.label))

It avoid evaluating test twice, and avoids the zip. Although I suppose
you have to bear in mind it now calls predict() on each element, not
the whole RDD.

On Tue, Jul 29, 2014 at 5:26 AM, Xiangrui Meng <me...@gmail.com> wrote:
> Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and
> master. If you don't want to switch to 1.0.1 or master, try to cache
> and count test first. -Xiangrui
>
> On Mon, Jul 28, 2014 at 6:07 PM, SK <sk...@gmail.com> wrote:
>> Hi,
>>
>> In order to evaluate the ML classification accuracy, I am zipping up the
>> prediction and test labels as follows and then comparing the pairs in
>> predictionAndLabel:
>>
>> val prediction = model.predict(test.map(_.features))
>> val predictionAndLabel = prediction.zip(test.map(_.label))
>>
>>
>> However, I am finding that predictionAndLabel.count() has fewer elements
>> than test.count().  For example, my test vector has 43 elements, but
>> predictionAndLabel has only 38 pairs. I have tried other samples and always
>> get fewer elements after zipping.
>>
>> Does zipping the two vectors cause any compression? or is this because of
>> the distributed nature of the algorithm (I am running it in local mode on a
>> single machine). In order to get the correct accuracy, I need the above
>> comparison to be done by a single node on the entire test data (my data is
>> quite small). How can I ensure that?
>>
>> thanks
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: evaluating classification accuracy

Posted by Xiangrui Meng <me...@gmail.com>.

Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and
master. If you don't want to switch to 1.0.1 or master, try to cache
and count test first. -Xiangrui

On Mon, Jul 28, 2014 at 6:07 PM, SK <sk...@gmail.com> wrote:
> Hi,
>
> In order to evaluate the ML classification accuracy, I am zipping up the
> prediction and test labels as follows and then comparing the pairs in
> predictionAndLabel:
>
> val prediction = model.predict(test.map(_.features))
> val predictionAndLabel = prediction.zip(test.map(_.label))
>
>
> However, I am finding that predictionAndLabel.count() has fewer elements
> than test.count().  For example, my test vector has 43 elements, but
> predictionAndLabel has only 38 pairs. I have tried other samples and always
> get fewer elements after zipping.
>
> Does zipping the two vectors cause any compression? or is this because of
> the distributed nature of the algorithm (I am running it in local mode on a
> single machine). In order to get the correct accuracy, I need the above
> comparison to be done by a single node on the entire test data (my data is
> quite small). How can I ensure that?
>
> thanks
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.