You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2017/08/03 11:55:06 UTC

[GitHub] spark pull request #18733: [SPARK-21535][ML]Reduce memory requirement for Cr...

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18733#discussion_r131121125
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
    @@ -112,16 +112,16 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String)
           val validationDataset = sparkSession.createDataFrame(validation, schema).cache()
           // multi-model training
           logDebug(s"Train split $splitIndex with multiple sets of parameters.")
    -      val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
    -      trainingDataset.unpersist()
           var i = 0
           while (i < numModels) {
    +        val model = est.fit(trainingDataset, epm(i)).asInstanceOf[Model[_]]
             // TODO: duplicate evaluator to take extra params from input
    -        val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
    +        val metric = eval.evaluate(model.transform(validationDataset, epm(i)))
             logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
             metrics(i) += metric
             i += 1
           }
    +      trainingDataset.unpersist()
    --- End diff --
    
    One consideration here is that we're unpersisting the training data only after all models (for a fold) are evaluated. This means the full dataset (train and validation) is in cluster memory throughout, whereas previously only one dataset would be in cluster memory at a time. It's possible the impact of this on resources may be a greater than the saving on the driver from storing `1` instead of `numModels` models temporarily per fold?
    
    It obviously depends on a lot of factors (dataset size, cluster resources, driver memory, model size, etc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org