You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Peter Rudenko <pe...@gmail.com> on 2015/02/11 20:13:03 UTC

[ml] Lost persistence for fold in crossvalidation.

Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch + 
LogisticRegression. I’ve reimplemented LogisticRegression.fit method and 
comment out instances.unpersist()

|override  def  fit(dataset:SchemaRDD, paramMap:ParamMap):LogisticRegressionModel  = {
     println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with ParamMap $paramMap.")
     transformSchema(dataset.schema, paramMap, logging =true)
     import  dataset.sqlContext._
     val  map  =  this.paramMap ++ paramMap
     val  instances  =  dataset.select(map(labelCol).attr, map(featuresCol).attr)
       .map {
         case  Row(label:Double, features:Vector) =>
           LabeledPoint(label, features)
       }

     if  (instances.getStorageLevel ==StorageLevel.NONE) {
       println("Instances not persisted")
       instances.persist(StorageLevel.MEMORY_AND_DISK)
     }

      val  lr  =  (new  LogisticRegressionWithLBFGS)
       .setValidateData(false)
       .setIntercept(true)
     lr.optimizer
       .setRegParam(map(regParam))
       .setNumIterations(map(maxIter))
     val  lrm  =  new  LogisticRegressionModel(this, map, lr.run(instances).weights)
     //instances.unpersist()
     // copy model params
     Params.inheritValues(map,this, lrm)
     lrm
   }
|

CrossValidator feeds the same SchemaRDD for each parameter (same hash 
code), but somewhere cache being flushed. The memory is enough. Here’s 
the output:

|Fitting dataset 2051470010 with ParamMap {
     DRLogisticRegression-f35ae4d3-regParam: 0.1
}.
Instances not persisted
Fitting dataset 2051470010 with ParamMap {
     DRLogisticRegression-f35ae4d3-regParam: 0.01
}.
Instances not persisted
Fitting dataset 2051470010 with ParamMap {
     DRLogisticRegression-f35ae4d3-regParam: 0.001
}.
Instances not persisted
Fitting dataset 802615223 with ParamMap {
     DRLogisticRegression-f35ae4d3-regParam: 0.1
}.
Instances not persisted
Fitting dataset 802615223 with ParamMap {
     DRLogisticRegression-f35ae4d3-regParam: 0.01
}.
Instances not persisted
|

I have 3 parameters in GridSearch and 3 folds for CrossValidation:

|
val  paramGrid  =  new  ParamGridBuilder()
   .addGrid(model.regParam,Array(0.1,0.01,0.001))
   .build()

crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(3)
|

I assume that the data should be read and cached 3 times (1 to 
numFolds).combinations(2) and be independent from number of parameters. 
But i have 9 times data being read and cached.

Thanks,
Peter Rudenko

Re: [ml] Lost persistence for fold in crossvalidation.

Posted by Joseph Bradley <jo...@databricks.com>.

Now in JIRA form: https://issues.apache.org/jira/browse/SPARK-5844

On Tue, Feb 17, 2015 at 3:12 PM, Xiangrui Meng <me...@gmail.com> wrote:

> There are three different regParams defined in the grid and there are
> tree folds. For simplicity, we didn't split the dataset into three and
> reuse them, but do the split for each fold. Then we need to cache 3*3
> times. Note that the pipeline API is not yet optimized for
> performance. It would be nice to optimize its perforamnce in 1.4.
> -Xiangrui
>
> On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko <pe...@gmail.com>
> wrote:
> > Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
> > LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
> > comment out instances.unpersist()
> >
> > |override  def  fit(dataset:SchemaRDD,
> > paramMap:ParamMap):LogisticRegressionModel  = {
> >     println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with
> > ParamMap $paramMap.")
> >     transformSchema(dataset.schema, paramMap, logging =true)
> >     import  dataset.sqlContext._
> >     val  map  =  this.paramMap ++ paramMap
> >     val  instances  =  dataset.select(map(labelCol).attr,
> > map(featuresCol).attr)
> >       .map {
> >         case  Row(label:Double, features:Vector) =>
> >           LabeledPoint(label, features)
> >       }
> >
> >     if  (instances.getStorageLevel ==StorageLevel.NONE) {
> >       println("Instances not persisted")
> >       instances.persist(StorageLevel.MEMORY_AND_DISK)
> >     }
> >
> >      val  lr  =  (new  LogisticRegressionWithLBFGS)
> >       .setValidateData(false)
> >       .setIntercept(true)
> >     lr.optimizer
> >       .setRegParam(map(regParam))
> >       .setNumIterations(map(maxIter))
> >     val  lrm  =  new  LogisticRegressionModel(this, map,
> > lr.run(instances).weights)
> >     //instances.unpersist()
> >     // copy model params
> >     Params.inheritValues(map,this, lrm)
> >     lrm
> >   }
> > |
> >
> > CrossValidator feeds the same SchemaRDD for each parameter (same hash
> code),
> > but somewhere cache being flushed. The memory is enough. Here’s the
> output:
> >
> > |Fitting dataset 2051470010 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.1
> > }.
> > Instances not persisted
> > Fitting dataset 2051470010 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.01
> > }.
> > Instances not persisted
> > Fitting dataset 2051470010 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.001
> > }.
> > Instances not persisted
> > Fitting dataset 802615223 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.1
> > }.
> > Instances not persisted
> > Fitting dataset 802615223 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.01
> > }.
> > Instances not persisted
> > |
> >
> > I have 3 parameters in GridSearch and 3 folds for CrossValidation:
> >
> > |
> > val  paramGrid  =  new  ParamGridBuilder()
> >   .addGrid(model.regParam,Array(0.1,0.01,0.001))
> >   .build()
> >
> > crossval.setEstimatorParamMaps(paramGrid)
> > crossval.setNumFolds(3)
> > |
> >
> > I assume that the data should be read and cached 3 times (1 to
> > numFolds).combinations(2) and be independent from number of parameters.
> But
> > i have 9 times data being read and cached.
> >
> > Thanks,
> > Peter Rudenko
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [ml] Lost persistence for fold in crossvalidation.

Posted by Xiangrui Meng <me...@gmail.com>.

There are three different regParams defined in the grid and there are
tree folds. For simplicity, we didn't split the dataset into three and
reuse them, but do the split for each fold. Then we need to cache 3*3
times. Note that the pipeline API is not yet optimized for
performance. It would be nice to optimize its perforamnce in 1.4.
-Xiangrui

On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko <pe...@gmail.com> wrote:
> Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
> LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
> comment out instances.unpersist()
>
> |override  def  fit(dataset:SchemaRDD,
> paramMap:ParamMap):LogisticRegressionModel  = {
>     println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with
> ParamMap $paramMap.")
>     transformSchema(dataset.schema, paramMap, logging =true)
>     import  dataset.sqlContext._
>     val  map  =  this.paramMap ++ paramMap
>     val  instances  =  dataset.select(map(labelCol).attr,
> map(featuresCol).attr)
>       .map {
>         case  Row(label:Double, features:Vector) =>
>           LabeledPoint(label, features)
>       }
>
>     if  (instances.getStorageLevel ==StorageLevel.NONE) {
>       println("Instances not persisted")
>       instances.persist(StorageLevel.MEMORY_AND_DISK)
>     }
>
>      val  lr  =  (new  LogisticRegressionWithLBFGS)
>       .setValidateData(false)
>       .setIntercept(true)
>     lr.optimizer
>       .setRegParam(map(regParam))
>       .setNumIterations(map(maxIter))
>     val  lrm  =  new  LogisticRegressionModel(this, map,
> lr.run(instances).weights)
>     //instances.unpersist()
>     // copy model params
>     Params.inheritValues(map,this, lrm)
>     lrm
>   }
> |
>
> CrossValidator feeds the same SchemaRDD for each parameter (same hash code),
> but somewhere cache being flushed. The memory is enough. Here’s the output:
>
> |Fitting dataset 2051470010 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.001
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> |
>
> I have 3 parameters in GridSearch and 3 folds for CrossValidation:
>
> |
> val  paramGrid  =  new  ParamGridBuilder()
>   .addGrid(model.regParam,Array(0.1,0.01,0.001))
>   .build()
>
> crossval.setEstimatorParamMaps(paramGrid)
> crossval.setNumFolds(3)
> |
>
> I assume that the data should be read and cached 3 times (1 to
> numFolds).combinations(2) and be independent from number of parameters. But
> i have 9 times data being read and cached.
>
> Thanks,
> Peter Rudenko
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org