You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Peter Rudenko <pe...@gmail.com> on 2015/02/11 20:13:03 UTC
[ml] Lost persistence for fold in crossvalidation.
Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
comment out instances.unpersist()
|override def fit(dataset:SchemaRDD, paramMap:ParamMap):LogisticRegressionModel = {
println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with ParamMap $paramMap.")
transformSchema(dataset.schema, paramMap, logging =true)
import dataset.sqlContext._
val map = this.paramMap ++ paramMap
val instances = dataset.select(map(labelCol).attr, map(featuresCol).attr)
.map {
case Row(label:Double, features:Vector) =>
LabeledPoint(label, features)
}
if (instances.getStorageLevel ==StorageLevel.NONE) {
println("Instances not persisted")
instances.persist(StorageLevel.MEMORY_AND_DISK)
}
val lr = (new LogisticRegressionWithLBFGS)
.setValidateData(false)
.setIntercept(true)
lr.optimizer
.setRegParam(map(regParam))
.setNumIterations(map(maxIter))
val lrm = new LogisticRegressionModel(this, map, lr.run(instances).weights)
//instances.unpersist()
// copy model params
Params.inheritValues(map,this, lrm)
lrm
}
|
CrossValidator feeds the same SchemaRDD for each parameter (same hash
code), but somewhere cache being flushed. The memory is enough. Here’s
the output:
|Fitting dataset 2051470010 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.1
}.
Instances not persisted
Fitting dataset 2051470010 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.01
}.
Instances not persisted
Fitting dataset 2051470010 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.001
}.
Instances not persisted
Fitting dataset 802615223 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.1
}.
Instances not persisted
Fitting dataset 802615223 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.01
}.
Instances not persisted
|
I have 3 parameters in GridSearch and 3 folds for CrossValidation:
|
val paramGrid = new ParamGridBuilder()
.addGrid(model.regParam,Array(0.1,0.01,0.001))
.build()
crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(3)
|
I assume that the data should be read and cached 3 times (1 to
numFolds).combinations(2) and be independent from number of parameters.
But i have 9 times data being read and cached.
Thanks,
Peter Rudenko
Re: [ml] Lost persistence for fold in crossvalidation.
Posted by Joseph Bradley <jo...@databricks.com>.
Now in JIRA form: https://issues.apache.org/jira/browse/SPARK-5844
On Tue, Feb 17, 2015 at 3:12 PM, Xiangrui Meng <me...@gmail.com> wrote:
> There are three different regParams defined in the grid and there are
> tree folds. For simplicity, we didn't split the dataset into three and
> reuse them, but do the split for each fold. Then we need to cache 3*3
> times. Note that the pipeline API is not yet optimized for
> performance. It would be nice to optimize its perforamnce in 1.4.
> -Xiangrui
>
> On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko <pe...@gmail.com>
> wrote:
> > Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
> > LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
> > comment out instances.unpersist()
> >
> > |override def fit(dataset:SchemaRDD,
> > paramMap:ParamMap):LogisticRegressionModel = {
> > println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with
> > ParamMap $paramMap.")
> > transformSchema(dataset.schema, paramMap, logging =true)
> > import dataset.sqlContext._
> > val map = this.paramMap ++ paramMap
> > val instances = dataset.select(map(labelCol).attr,
> > map(featuresCol).attr)
> > .map {
> > case Row(label:Double, features:Vector) =>
> > LabeledPoint(label, features)
> > }
> >
> > if (instances.getStorageLevel ==StorageLevel.NONE) {
> > println("Instances not persisted")
> > instances.persist(StorageLevel.MEMORY_AND_DISK)
> > }
> >
> > val lr = (new LogisticRegressionWithLBFGS)
> > .setValidateData(false)
> > .setIntercept(true)
> > lr.optimizer
> > .setRegParam(map(regParam))
> > .setNumIterations(map(maxIter))
> > val lrm = new LogisticRegressionModel(this, map,
> > lr.run(instances).weights)
> > //instances.unpersist()
> > // copy model params
> > Params.inheritValues(map,this, lrm)
> > lrm
> > }
> > |
> >
> > CrossValidator feeds the same SchemaRDD for each parameter (same hash
> code),
> > but somewhere cache being flushed. The memory is enough. Here’s the
> output:
> >
> > |Fitting dataset 2051470010 with ParamMap {
> > DRLogisticRegression-f35ae4d3-regParam: 0.1
> > }.
> > Instances not persisted
> > Fitting dataset 2051470010 with ParamMap {
> > DRLogisticRegression-f35ae4d3-regParam: 0.01
> > }.
> > Instances not persisted
> > Fitting dataset 2051470010 with ParamMap {
> > DRLogisticRegression-f35ae4d3-regParam: 0.001
> > }.
> > Instances not persisted
> > Fitting dataset 802615223 with ParamMap {
> > DRLogisticRegression-f35ae4d3-regParam: 0.1
> > }.
> > Instances not persisted
> > Fitting dataset 802615223 with ParamMap {
> > DRLogisticRegression-f35ae4d3-regParam: 0.01
> > }.
> > Instances not persisted
> > |
> >
> > I have 3 parameters in GridSearch and 3 folds for CrossValidation:
> >
> > |
> > val paramGrid = new ParamGridBuilder()
> > .addGrid(model.regParam,Array(0.1,0.01,0.001))
> > .build()
> >
> > crossval.setEstimatorParamMaps(paramGrid)
> > crossval.setNumFolds(3)
> > |
> >
> > I assume that the data should be read and cached 3 times (1 to
> > numFolds).combinations(2) and be independent from number of parameters.
> But
> > i have 9 times data being read and cached.
> >
> > Thanks,
> > Peter Rudenko
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
Re: [ml] Lost persistence for fold in crossvalidation.
Posted by Xiangrui Meng <me...@gmail.com>.
There are three different regParams defined in the grid and there are
tree folds. For simplicity, we didn't split the dataset into three and
reuse them, but do the split for each fold. Then we need to cache 3*3
times. Note that the pipeline API is not yet optimized for
performance. It would be nice to optimize its perforamnce in 1.4.
-Xiangrui
On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko <pe...@gmail.com> wrote:
> Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
> LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
> comment out instances.unpersist()
>
> |override def fit(dataset:SchemaRDD,
> paramMap:ParamMap):LogisticRegressionModel = {
> println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with
> ParamMap $paramMap.")
> transformSchema(dataset.schema, paramMap, logging =true)
> import dataset.sqlContext._
> val map = this.paramMap ++ paramMap
> val instances = dataset.select(map(labelCol).attr,
> map(featuresCol).attr)
> .map {
> case Row(label:Double, features:Vector) =>
> LabeledPoint(label, features)
> }
>
> if (instances.getStorageLevel ==StorageLevel.NONE) {
> println("Instances not persisted")
> instances.persist(StorageLevel.MEMORY_AND_DISK)
> }
>
> val lr = (new LogisticRegressionWithLBFGS)
> .setValidateData(false)
> .setIntercept(true)
> lr.optimizer
> .setRegParam(map(regParam))
> .setNumIterations(map(maxIter))
> val lrm = new LogisticRegressionModel(this, map,
> lr.run(instances).weights)
> //instances.unpersist()
> // copy model params
> Params.inheritValues(map,this, lrm)
> lrm
> }
> |
>
> CrossValidator feeds the same SchemaRDD for each parameter (same hash code),
> but somewhere cache being flushed. The memory is enough. Here’s the output:
>
> |Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.001
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> |
>
> I have 3 parameters in GridSearch and 3 folds for CrossValidation:
>
> |
> val paramGrid = new ParamGridBuilder()
> .addGrid(model.regParam,Array(0.1,0.01,0.001))
> .build()
>
> crossval.setEstimatorParamMaps(paramGrid)
> crossval.setNumFolds(3)
> |
>
> I assume that the data should be read and cached 3 times (1 to
> numFolds).combinations(2) and be independent from number of parameters. But
> i have 9 times data being read and cached.
>
> Thanks,
> Peter Rudenko
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org