You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by hhbyyh <gi...@git.apache.org> on 2017/07/25 22:00:20 UTC

[GitHub] spark pull request #18733: [SPARK-21535][ML]Reduce memory requirement for Cr...

GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/18733

    [SPARK-21535][ML]Reduce memory requirement for CrossValidator and TrainValidationSplit

    ## What changes were proposed in this pull request?
    
    CrossValidator and TrainValidationSplit both use
    `models = est.fit(trainingDataset, epm) `
    to fit the models, where epm is `Array[ParamMap]`.
    Even though the training process is sequential, current implementation consumes extra driver memory for holding the trained models, which is not necessary and often leads to memory exception for both CrossValidator and TrainValidationSplit. My proposal is to optimize the training implementation, thus that used local model can be collected by GC, and avoid the unnecessary OOM exceptions.
    
    E.g. when grid search space is 12, old implementation needs to hold all 12 trained models in the driver memory at the same time, while the new implementation only needs to hold 1 trained model at a time, and previous model can be cleared by GC
    
    ## How was this patch tested?
    
    Existing unit test since there's no change to logic.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark singleModel

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18733.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18733
    
----
commit a7667e72d78f679b9693e22742e8a624b6348fd2
Author: Yuhao Yang <yu...@intel.com>
Date:   2017-07-25T21:41:17Z

    memory optimization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18733: [SPARK-21535][ML]Reduce memory requirement for Cr...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18733#discussion_r131121125
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
    @@ -112,16 +112,16 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String)
           val validationDataset = sparkSession.createDataFrame(validation, schema).cache()
           // multi-model training
           logDebug(s"Train split $splitIndex with multiple sets of parameters.")
    -      val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
    -      trainingDataset.unpersist()
           var i = 0
           while (i < numModels) {
    +        val model = est.fit(trainingDataset, epm(i)).asInstanceOf[Model[_]]
             // TODO: duplicate evaluator to take extra params from input
    -        val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
    +        val metric = eval.evaluate(model.transform(validationDataset, epm(i)))
             logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
             metrics(i) += metric
             i += 1
           }
    +      trainingDataset.unpersist()
    --- End diff --
    
    One consideration here is that we're unpersisting the training data only after all models (for a fold) are evaluated. This means the full dataset (train and validation) is in cluster memory throughout, whereas previously only one dataset would be in cluster memory at a time. It's possible the impact of this on resources may be a greater than the saving on the driver from storing `1` instead of `numModels` models temporarily per fold?
    
    It obviously depends on a lot of factors (dataset size, cluster resources, driver memory, model size, etc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18733: [SPARK-21535][ML]Reduce memory requirement for Cr...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh closed the pull request at:

    https://github.com/apache/spark/pull/18733


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18733: [SPARK-21535][ML]Reduce memory requirement for CrossVali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18733
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79944/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18733: [SPARK-21535][ML]Reduce memory requirement for CrossVali...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/18733
  
    Features should be merged when they are reasonable and ready, but not waiting on uncertain changes especially when there's no conflicts. Spark is already way too slow.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18733: [SPARK-21535][ML]Reduce memory requirement for CrossVali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18733
  
    **[Test build #79944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79944/testReport)** for PR 18733 at commit [`a7667e7`](https://github.com/apache/spark/commit/a7667e72d78f679b9693e22742e8a624b6348fd2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18733: [SPARK-21535][ML]Reduce memory requirement for CrossVali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18733
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18733: [SPARK-21535][ML]Reduce memory requirement for CrossVali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18733
  
    **[Test build #79944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79944/testReport)** for PR 18733 at commit [`a7667e7`](https://github.com/apache/spark/commit/a7667e72d78f679b9693e22742e8a624b6348fd2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18733: [SPARK-21535][ML]Reduce memory requirement for CrossVali...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/18733
  
    Nothing of this change depends on #16774. 
    
    The basic idea is that we should release the driver memory as soon as a trained model is evaluated. I don't see there's any conflict.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18733: [SPARK-21535][ML]Reduce memory requirement for Cr...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18733#discussion_r131268294
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
    @@ -112,16 +112,16 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String)
           val validationDataset = sparkSession.createDataFrame(validation, schema).cache()
           // multi-model training
           logDebug(s"Train split $splitIndex with multiple sets of parameters.")
    -      val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
    -      trainingDataset.unpersist()
           var i = 0
           while (i < numModels) {
    +        val model = est.fit(trainingDataset, epm(i)).asInstanceOf[Model[_]]
             // TODO: duplicate evaluator to take extra params from input
    -        val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
    +        val metric = eval.evaluate(model.transform(validationDataset, epm(i)))
             logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
             metrics(i) += metric
             i += 1
           }
    +      trainingDataset.unpersist()
    --- End diff --
    
    Ah you're right. I was under the wrong impression that validationDataset is always in the memory.
    
    Even though the size of validationDataset is `1/kfold` of the trainingDataset's and it's only used in the `transform` but not the `fit` process, I still cannot prove that the new implementation is better in all the circumstance. 
    
    I'll close the PR unless there's a better way to resolve the concern. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org