You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by holdingonrobin <ro...@gmail.com> on 2014/06/23 23:21:38 UTC

How to use K-fold validation in spark-1.0?

Hello,

I noticed there are some discussions about implementing K-fold validation to
Mllib on Spark and believe it should be in Spark-1.0 now. However there
isn't any documentation or example about how to use it in the training.
While I am reading the code to find out, does anyone use it successfully or
know where I can find useful information? Thank you very much!




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to use K-fold validation in spark-1.0?

Posted by holdingonrobin <ro...@gmail.com>.
Thanks Evan! I think it works!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to use K-fold validation in spark-1.0?

Posted by "Evan R. Sparks" <ev...@gmail.com>.
There is a method in org.apache.spark.mllib.util.MLUtils called "kFold"
which will automatically partition your dataset for you into k train/test
splits at which point you can build k different models and aggregate the
results.

For example (a very rough sketch - assuming I want to do 10-fold cross
validation on a binary classification model on a file with 1000 features in
it):

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.util.LabelParsers
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD


val dat = MLUtils.loadLibSVMFile(sc, "path/to/data", false, 1000)

val cvdat = kFold(dat, 10, 42)

val modelErrrors = cvdat.map { case (train, test) => {
   val model = LogisticRegressionWithSGD.train(train, 100, 0.1, 1.0)
   val error = computeError(model, test)
    (model, error)}}

//Average error:
val avgError = modelErrors.map(_._2).reduce(_ + _) / modelErrors.length

Here, I'm assuming you've got some "computeError" function defined. Note
that many of these APIs are marked "experimental" and thus might change in
a future spark release.


On Tue, Jun 24, 2014 at 6:44 AM, Eustache DIEMERT <eu...@diemert.fr>
wrote:

> I'm interested in this topic too :)
>
> Are the MLLib core devs on this list ?
>
> E/
>
>
> 2014-06-24 14:19 GMT+02:00 holdingonrobin <ro...@gmail.com>:
>
> Anyone knows anything about it? Or should I actually move this topic to a
>> MLlib specif mailing list? Any information is appreciated! Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: How to use K-fold validation in spark-1.0?

Posted by Eustache DIEMERT <eu...@diemert.fr>.
I'm interested in this topic too :)

Are the MLLib core devs on this list ?

E/


2014-06-24 14:19 GMT+02:00 holdingonrobin <ro...@gmail.com>:

> Anyone knows anything about it? Or should I actually move this topic to a
> MLlib specif mailing list? Any information is appreciated! Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: How to use K-fold validation in spark-1.0?

Posted by holdingonrobin <ro...@gmail.com>.
Anyone knows anything about it? Or should I actually move this topic to a
MLlib specif mailing list? Any information is appreciated! Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.