You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2016/03/23 20:17:37 UTC

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/11919

    [SPARK-6717][ML] Clear shuffle files after checkpointing in ALS

    ## What changes were proposed in this pull request?
    
    When ALS is run with a checkpoint interval, during the checkpoint materialize the current state and cleanup the previous shuffles (non-blocking).
    
    ## How was this patch tested?
    
    Existing ALS unit tests, new ALS checkpoint cleanup unit tests added (blocking for testing).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark SPARK-6717-clear-shuffle-files-after-checkpointing-in-ALS

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11919.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11919
    
----
commit 0931403387041d94bb95a860b0d7bb7526ea08b5
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-03-23T02:07:21Z

    Cleanup comment, isn't always async

commit aeebc387ce8d5fac33b606d6fefa4d027a324b83
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-03-23T02:07:48Z

    Switch to checkpoint & cleanup technique inside of ALS

commit 341daac848b1327babf760a418499afb67987690
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-03-23T04:16:33Z

    Mention we materialize in the checkPoint

commit 542ea9718459de213cbffd75249aaf94b8c86eb1
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-03-23T04:18:19Z

    Don't unpersist since we already do automatically.

commit 2f1939904259ac625f839c5cbca90790ead37ef5
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-03-23T18:43:44Z

    Some style fixes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11919


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61324736
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return
    +    }
    +    val cleaner = sc.cleaner.get
    +    /**
    +     * Clean the shuffles & all of its parents.
    +     */
    +    def cleanEagerly(dep: Dependency[_]): Unit = {
    --- End diff --
    
    So it will be at most checkpoint interval deep, so it shouldn't be super long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216097488
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57493/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61709679
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    --- End diff --
    
    Yes, its used so that I can know what the output shuffle files look like.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-204500473
  
    **[Test build #54709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54709/consoleFull)** for PR 11919 at commit [`2f19399`](https://github.com/apache/spark/commit/2f1939904259ac625f839c5cbca90790ead37ef5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213137319
  
    **[Test build #56584 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56584/consoleFull)** for PR 11919 at commit [`58be94b`](https://github.com/apache/spark/commit/58be94b823d663af35fc2db46be5b23ba2c46cea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class SparkSqlParser(conf: SQLConf) extends AbstractSqlParser`
      * `class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder `
      * `   *   [STORED AS file_format | STORED BY storage_handler_class [WITH SERDEPROPERTIES (...)]]`
      * `case class AddJar(path: String) extends RunnableCommand `
      * `case class AddFile(path: String) extends RunnableCommand `
      * `case class CreateTableAsSelectLogicalPlan(`
      * `case class CreateViewAsSelectLogicalCommand(`
      * `case class HiveSerDe(`
      * `class VariableSubstitution(conf: SQLConf) `
      * `  implicit class SchemaAttribute(f: CatalogColumn) `
      * `class HiveSqlParser(conf: SQLConf) extends AbstractSqlParser `
      * `class HiveSqlAstBuilder(conf: SQLConf) extends SparkSqlAstBuilder(conf) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708647
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val filesBefore = getAllFiles
    +        val input = sc.parallelize(1 to 1000)
    +        val keyed = input.map(x => (x % 20, 1))
    +        val shuffled = keyed.reduceByKey(_ + _)
    +        val keysOnly = shuffled.map{case (x, _) => x}
    +        val deps = keysOnly.dependencies
    +        keysOnly.checkpoint()
    +        keysOnly.count()
    +        ALS.cleanShuffleDependencies(sc, deps, true)
    +        assert(keysOnly.isCheckpointed)
    +        val resultingFiles = getAllFiles -- filesBefore
    +        assert(resultingFiles === Set())
    +      } finally {
    +        sc.stop()
    +      }
    +    } finally {
    +      Utils.deleteRecursively(localDir)
    +      Utils.deleteRecursively(tempDir)
    +    }
    +  }
    +
    +  test("ALS shuffle cleanup") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Generate test data
    +        val (training, _) = ALSSuite.genImplicitTestData(sc, 100, 10, 1, 0.2, 0)
    +        // Implicitly test the cleaning of parents during ALS training
    +        val filesBefore = getAllFiles
    +        val sqlContext = new SQLContext(sc)
    +        import sqlContext.implicits._
    +        val als = new ALS()
    +          .setRank(1)
    +          .setRegParam(1e-5)
    +          .setSeed(0)
    +          .setCheckpointInterval(1)
    +        val model = als.fit(training.toDF())
    +        val resultingFiles = getAllFiles -- filesBefore
    +        // We expect the last shuffles file to be around, but no more
    +        val pattern = "shuffle_(\\d+)_.+\\.data".r
    +        val rddIds = resultingFiles.flatMap(f =>
    --- End diff --
    
    `.flatMap { f =>`. We usually use `(f => ...)` if it fits one line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207164324
  
    **[Test build #55295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55295/consoleFull)** for PR 11919 at commit [`c1493a1`](https://github.com/apache/spark/commit/c1493a167a063128a8badd8b5ab4ff7293c57945).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61318645
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -541,4 +585,82 @@ object ALSSuite {
         "nonnegative" -> true,
         "checkpointInterval" -> 20
       )
    +
    +  // Helper functions to generate test data we share between ALS test suites
    +
    +  /**
    +   * Generates random user/item factors, with i.i.d. values drawn from U(a, b).
    +   * @param size number of users/items
    +   * @param rank number of features
    +   * @param random random number generator
    +   * @param a min value of the support (default: -1)
    +   * @param b max value of the support (default: 1)
    +   * @return a sequence of (ID, factors) pairs
    +   */
    +  private def genFactors(
    --- End diff --
    
    yup


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60913987
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -514,7 +483,83 @@ class ALSSuite
       }
     }
     
    -object ALSSuite {
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val filesBefore = getAllFiles
    +        val input = sc.parallelize(1 to 1000)
    +        val keyed = input.map(x => (x % 20, 1))
    +        val shuffled = keyed.reduceByKey(_ + _)
    +        val keysOnly = shuffled.map{case (x, _) => x}
    +        val deps = keysOnly.dependencies
    +        keysOnly.checkpoint()
    +        keysOnly.count()
    +        ALS.cleanDependencies(sc, deps, true)
    +        assert(keysOnly.isCheckpointed)
    +        val resultingFiles = getAllFiles -- filesBefore
    +        assert(resultingFiles === Set())
    +      } finally {
    +        sc.stop()
    +      }
    +    } finally {
    +      Utils.deleteRecursively(localDir)
    +      Utils.deleteRecursively(tempDir)
    +    }
    +  }
    +
    +  test("ALS shuffle cleanup") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Generate test data
    +        val (training, _) = ALSSuite.genImplicitTestData(sc, 100, 10, 1, 0.2, 0)
    +        // Implicitly test the cleaning of parents during ALS training
    +        val filesBefore = getAllFiles
    +        val sqlContext = new SQLContext(sc)
    +        import sqlContext.implicits._
    +        val als = new ALS()
    +          .setRank(1)
    +          .setRegParam(1e-5)
    +          .setSeed(0)
    +          .setCheckpointInterval(1)
    +        val alpha = als.getAlpha
    --- End diff --
    
    what is this line for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60511187
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    @MLnick Actually, that plan doesn't work - checkpointing kills all of the parents information we need to clean up the shuffle files. I could refactor this so that we capture the dependency information needed - but a count() on a cached RDD should be low enough cost I'm not sure it would be worth it. What are your thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207137100
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61323699
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return
    +    }
    +    val cleaner = sc.cleaner.get
    +    /**
    +     * Clean the shuffles & all of its parents.
    +     */
    +    def cleanEagerly(dep: Dependency[_]): Unit = {
    --- End diff --
    
    Oh right, I see that now. OK. Are we worried the dependency graph will be super deep? probably not right, since chained shuffle dependencies woudl be rare?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-206067767
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60726431
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    @mengxr just want to get your opinion on this - I think an extra `count` in the implicit case is not too bad.
    
    Having said that, I took another look and it would only be a few extra lines to capture the shuffle deps in L660 & L676 (similar to how `previousCheckpointFile` works), and clean them just before `deletePreviousCheckpointFile()` is called.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-212696347
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207888801
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60969566
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,13 +656,15 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             previousItemFactors.unpersist()
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
    +        val deps = itemFactors.dependencies
    --- End diff --
    
    Ah sorry you're correct. Keep it as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61316237
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return
    +    }
    +    val cleaner = sc.cleaner.get
    --- End diff --
    
    Are these 4 lines easier as `sc.cleaner.foreach { cleaner => `?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216117214
  
    I'm making a pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r58971979
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    So digging into this I noticed checkpointing behaves a bit oddly when the input RDD is already materialized
    
    > scala> val rdd = sc.parallelize(1.to(10)).map(_.toString).cache()
    > rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at map at <console>:25
    > 
    > scala> rdd.checkpoint()
    > 
    > scala> rdd.count()
    > res2: Long = 10                                                                 
    > 
    > scala> rdd.isCheckpointed
    > res3: Boolean = true
    > 
    
    However:
    
    > scala> val rdd = sc.parallelize(1.to(10)).map(_.toString).cache()
    > rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at map at <console>:25
    > 
    > scala> rdd.count()
    > res4: Long = 10
    > 
    > scala> rdd.checkpoint()
    > 
    > scala> rdd.isCheckpointed
    > res6: Boolean = false
    > 
    > scala> rdd.count()
    > res7: Long = 10
    > 
    > scala> rdd.isCheckpointed
    > res8: Boolean = false
    
    The documentation for checkpointing a bit confusing since it says 
    
    > This function must be called before any job has been executed on this RDD.
    
    yet also says
    
    > It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
    
    Perhaps we should also clarify the checkpointing javadoc to state:
    
    > It is strongly recommended that this RDD is marked for in memory persistance, otherwise saving it on a file will require recomputation.
    
    What are your thoughts?
    That being said I'm pretty sure itemFactors hasn't already been materalized at this point since we only mark it for persistence on L656 and there isn't any action before then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207136944
  
    **[Test build #55248 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55248/consoleFull)** for PR 11919 at commit [`c1493a1`](https://github.com/apache/spark/commit/c1493a167a063128a8badd8b5ab4ff7293c57945).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60913444
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up its all of the shuffles eagerly.
    +   */
    +  private[spark] def cleanDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    --- End diff --
    
    again minor, but let's call it `cleanShuffleDependencies`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216125232
  
    **[Test build #57520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57520/consoleFull)** for PR 11919 at commit [`3a18915`](https://github.com/apache/spark/commit/3a18915ee9c803deb494eee77ba960646e2d470c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r58922363
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    It shouldn't be too bad on a cached RDD but I can change it to take a materilization action so we don't execute count twice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214568396
  
    **[Test build #56929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56929/consoleFull)** for PR 11919 at commit [`83a938e`](https://github.com/apache/spark/commit/83a938e8d4a42bf1f07a42114f1b01c1ff24b805).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216176722
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57520/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-210611413
  
    @holdenk made a few comments above. I think we can simplify to just have a `cleanParents` method which is called before `deletePreviousCheckpointFile`. Perhaps we can call it something like `cleanShuffleDependencies` to be even more clear. In this way, the existing checkpoint flagging and materialization is unchanged.
    
    It would be good to also test that this fixes the issue in [SPARK-6334](https://issues.apache.org/jira/browse/SPARK-6334) in actually running an ALS job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61764855
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1355,4 +1359,28 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    sc.cleaner.foreach{ cleaner =>
    +      /**
    +       * Clean the shuffles & all of its parents.
    +       */
    +      def cleanEagerly(dep: Dependency[_]): Unit = {
    +        if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    +          val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    +          cleaner.doCleanupShuffle(shuffleId, blocking)
    --- End diff --
    
    I don't know the behavior when the shuffle files are deleted but we still need to recover the RDDs that depend on them. The safest approach would be cleaning up to the cache RDDs, but it is still valuable to figure out the current behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708547
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val filesBefore = getAllFiles
    --- End diff --
    
    Do we expect files generated automatically by `sc`? If so, we should leave a comment here and explain there are some initial files created by `sc.setCheckpointDir`. Otherwise, we don't need `filesBefore` because it should be empty.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60913012
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,13 +656,15 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             previousItemFactors.unpersist()
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
    +        val deps = itemFactors.dependencies
    --- End diff --
    
    really minor, but let's move this a line down under `if (shouldCheckpoint(iter)) { `, just to match the explicit case below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214522903
  
    **[Test build #56915 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56915/consoleFull)** for PR 11919 at commit [`e980596`](https://github.com/apache/spark/commit/e980596cc1f87893ef90a4eb4d17937f00fad343).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61710810
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1355,4 +1359,28 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    sc.cleaner.foreach{ cleaner =>
    +      /**
    +       * Clean the shuffles & all of its parents.
    +       */
    +      def cleanEagerly(dep: Dependency[_]): Unit = {
    +        if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    +          val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    +          cleaner.doCleanupShuffle(shuffleId, blocking)
    --- End diff --
    
    The test case has multiple iterations inside of it. Node failure recovery is handled from the checkpoint files. If the shuffle files are gone we should be able to fall back to recomputing the stage that created the shuffle files no? I can certainly change it though (and I'll run a job tomorrow to verify).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r58854519
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    `cleanEagerly` does a `rdd.count()` - in the implicit case I believe `itemFactors` is already materialized hence in this case we would be doing 2 counts which may be wasteful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-201080302
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54100/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14099][ML] Clear shuffle files after ch...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-200503416
  
    **[Test build #53962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53962/consoleFull)** for PR 11919 at commit [`2f19399`](https://github.com/apache/spark/commit/2f1939904259ac625f839c5cbca90790ead37ef5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-215185023
  
    @mengxr @jkbradley @srowen can you take a quick pass please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-205986044
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55021/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-204499348
  
    nifty bot, since I'm hear jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61319192
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return
    +    }
    +    val cleaner = sc.cleaner.get
    +    /**
    +     * Clean the shuffles & all of its parents.
    +     */
    +    def cleanEagerly(dep: Dependency[_]): Unit = {
    --- End diff --
    
    Since its recursive internal def would be easier to read I figured.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214375710
  
    @holdenk made a few very minor comments, otherwise this LGTM. For good measure I ran the same ALS shuffle clean up test in master (without this PR) and confirmed it failed with a set of shuffle files > 1.
    
    @mengxr (or @jkbradley @srowen) could you take a quick pass?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-212696351
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56449/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61767887
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val checkpointDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(checkpointDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val input = sc.parallelize(1 to 1000)
    +        val keyed = input.map(x => (x % 20, 1))
    +        val shuffled = keyed.reduceByKey(_ + _)
    +        val keysOnly = shuffled.keys
    +        val deps = keysOnly.dependencies
    +        keysOnly.count()
    +        ALS.cleanShuffleDependencies(sc, deps, true)
    +        val resultingFiles = getAllFiles
    +        assert(resultingFiles === Set())
    +        // Ensure running count again works fine even if we kill the shuffle files.
    +        keysOnly.count()
    +      } finally {
    +        sc.stop()
    +      }
    +    } finally {
    +      Utils.deleteRecursively(localDir)
    +      Utils.deleteRecursively(checkpointDir)
    +    }
    +  }
    +
    +  test("ALS shuffle cleanup") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val checkpointDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(checkpointDir.getAbsolutePath)
    +        // Generate test data
    +        val (training, _) = ALSSuite.genImplicitTestData(sc, 100, 10, 1, 0.2, 0)
    +        // Implicitly test the cleaning of parents during ALS training
    +        val sqlContext = new SQLContext(sc)
    +        import sqlContext.implicits._
    +        val als = new ALS()
    +          .setRank(1)
    +          .setRegParam(1e-5)
    +          .setSeed(0)
    +          .setCheckpointInterval(1)
    +          .setMaxIter(50)
    --- End diff --
    
    What was wrong initially with the default max iter of 10?
    On Mon, 2 May 2016 at 18:48, Xiangrui Meng <no...@github.com> wrote:
    
    > In mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala
    > <https://github.com/apache/spark/pull/11919#discussion_r61765218>:
    >
    > > +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    > > +      conf.set("spark.shuffle.
    >
    > manager", "sort")
    > > +      val sc = new SparkContext("local[2]", "test", conf)
    > > +      try {
    > > +        sc.setCheckpointDir(checkpointDir.getAbsolutePath)
    > > +        // Generate test data
    > > +        val (training, _) = ALSSuite.genImplicitTestData(sc, 100, 10, 1, 0.2, 0)
    > > +        // Implicitly test the cleaning of parents during ALS training
    > > +        val sqlContext = new SQLContext(sc)
    > > +        import sqlContext.implicits._
    > > +        val als = new ALS()
    > > +          .setRank(1)
    > > +          .setRegParam(1e-5)
    > > +          .setSeed(0)
    > > +          .setCheckpointInterval(1)
    > > +          .setMaxIter(50)
    >
    > This test takes 20 seconds to run on my local. Please optimize the time
    > required.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11919/files/3a18915ee9c803deb494eee77ba960646e2d470c#r61765218>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214568577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56929/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214523148
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56915/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14099][ML] Clear shuffle files after ch...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-200545527
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53962/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213117781
  
    **[Test build #56591 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56591/consoleFull)** for PR 11919 at commit [`b78d9f6`](https://github.com/apache/spark/commit/b78d9f6f83d69710d0789c36a1393dbec193543e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61316331
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return
    +    }
    +    val cleaner = sc.cleaner.get
    +    /**
    +     * Clean the shuffles & all of its parents.
    +     */
    +    def cleanEagerly(dep: Dependency[_]): Unit = {
    --- End diff --
    
    Does this need to be an internal def as opposed to just a closure? could also be more straightforward to read


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213138064
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56584/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216088321
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-212671924
  
    **[Test build #56449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56449/consoleFull)** for PR 11919 at commit [`8c9cc15`](https://github.com/apache/spark/commit/8c9cc1536051361922b556ff0382162f52519fee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213117381
  
    I've added a test which does an ALS training run with a checkpoint interval of one and ensures only the most recent shuffle files remain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61764939
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    --- End diff --
    
    I don't see any thing particular in this test that requires sort shuffle. I removed this line and the test still passed on my local.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708521
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    --- End diff --
    
    `test("ALS. cleanShuffleDependencies")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207069254
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-215252925
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-206041431
  
    **[Test build #55058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55058/consoleFull)** for PR 11919 at commit [`ba36127`](https://github.com/apache/spark/commit/ba361273006782e53b1e1b002edb0ac27ef9b4b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60822985
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1306,33 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to checkpoint the RDD and clean up its all of its parents' shuffles eagerly.
    +   */
    +  private[spark] def checkpointAndCleanParents[T](rdd: RDD[T], blocking: Boolean = false): Unit = {
    +    val sc = rdd.sparkContext
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return rdd.checkpoint()
    --- End diff --
    
    Ah thats a good catch (this used to not be an issue since I left the materilization in the initial PR for both). Anyways I'll refactor this to break up the cleanup and explicitly capture the deps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216176721
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216089159
  
    **[Test build #57493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57493/consoleFull)** for PR 11919 at commit [`95abf54`](https://github.com/apache/spark/commit/95abf54726bf201d1dc9e0511ac25ed6e12946be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708653
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -620,4 +663,82 @@ object ALSSuite {
         "intermediateStorageLevel" -> "MEMORY_ONLY",
         "finalStorageLevel" -> "MEMORY_AND_DISK_SER"
       )
    +
    +  // Helper functions to generate test data we share between ALS test suites
    +
    +  /**
    +   * Generates random user/item factors, with i.i.d. values drawn from U(a, b).
    +   * @param size number of users/items
    +   * @param rank number of features
    +   * @param random random number generator
    +   * @param a min value of the support (default: -1)
    +   * @param b max value of the support (default: 1)
    +   * @return a sequence of (ID, factors) pairs
    +   */
    +  private def genFactors(
    +      size: Int,
    +      rank: Int,
    +      random: Random,
    +      a: Float = -1.0f,
    +      b: Float = 1.0f): Seq[(Int, Array[Float])] = {
    +    require(size > 0 && size < Int.MaxValue / 3)
    +    require(b > a)
    +    val ids = mutable.Set.empty[Int]
    +    while (ids.size < size) {
    +      ids += random.nextInt()
    +    }
    +    val width = b - a
    +    ids.toSeq.sorted.map(id => (id, Array.fill(rank)(a + random.nextFloat() * width)))
    +  }
    +
    +  /**
    +   * Generates an implicit feedback dataset for testing ALS.
    +   *
    +   * @param sc SparkContext
    +   * @param numUsers number of users
    +   * @param numItems number of items
    +   * @param rank rank
    +   * @param noiseStd the standard deviation of additive Gaussian noise on training data
    +   * @param seed random seed
    +   * @return (training, test)
    +   */
    +  def genImplicitTestData(
    +      sc: SparkContext,
    +      numUsers: Int,
    +      numItems: Int,
    +      rank: Int,
    +      noiseStd: Double = 0.0,
    +      seed: Long = 11L): (RDD[Rating[Int]], RDD[Rating[Int]]) = {
    +      // The assumption of the implicit feedback model is that unobserved ratings are more likely to
    --- End diff --
    
    fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-204544712
  
    **[Test build #54709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54709/consoleFull)** for PR 11919 at commit [`2f19399`](https://github.com/apache/spark/commit/2f1939904259ac625f839c5cbca90790ead37ef5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ALSCleanerSuite extends SparkFunSuite `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14099][ML] Clear shuffle files after ch...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-200545277
  
    **[Test build #53962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53962/consoleFull)** for PR 11919 at commit [`2f19399`](https://github.com/apache/spark/commit/2f1939904259ac625f839c5cbca90790ead37ef5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ALSCleanerSuite extends SparkFunSuite `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213156620
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207163592
  
    Streaming failures seem likely unrelated, jenkins retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-212696237
  
    **[Test build #56449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56449/consoleFull)** for PR 11919 at commit [`8c9cc15`](https://github.com/apache/spark/commit/8c9cc1536051361922b556ff0382162f52519fee).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-212917172
  
    Yeah by default the factors RDDs are persisted in memory... or if not at least memory_disk. I guess having that small additional cost and actually cleaning up the shuffle deps and not ultimately eating all the disk space and crashing the program, is the lesser of evils :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708643
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val filesBefore = getAllFiles
    +        val input = sc.parallelize(1 to 1000)
    +        val keyed = input.map(x => (x % 20, 1))
    +        val shuffled = keyed.reduceByKey(_ + _)
    +        val keysOnly = shuffled.map{case (x, _) => x}
    +        val deps = keysOnly.dependencies
    +        keysOnly.checkpoint()
    +        keysOnly.count()
    +        ALS.cleanShuffleDependencies(sc, deps, true)
    +        assert(keysOnly.isCheckpointed)
    +        val resultingFiles = getAllFiles -- filesBefore
    +        assert(resultingFiles === Set())
    +      } finally {
    +        sc.stop()
    +      }
    +    } finally {
    +      Utils.deleteRecursively(localDir)
    +      Utils.deleteRecursively(tempDir)
    +    }
    +  }
    +
    +  test("ALS shuffle cleanup") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Generate test data
    +        val (training, _) = ALSSuite.genImplicitTestData(sc, 100, 10, 1, 0.2, 0)
    +        // Implicitly test the cleaning of parents during ALS training
    +        val filesBefore = getAllFiles
    +        val sqlContext = new SQLContext(sc)
    +        import sqlContext.implicits._
    +        val als = new ALS()
    +          .setRank(1)
    --- End diff --
    
    set `maxIter` explicitly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708507
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1355,4 +1359,28 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    sc.cleaner.foreach{ cleaner =>
    +      /**
    +       * Clean the shuffles & all of its parents.
    +       */
    +      def cleanEagerly(dep: Dependency[_]): Unit = {
    +        if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    +          val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    +          cleaner.doCleanupShuffle(shuffleId, blocking)
    --- End diff --
    
    Does it recursively clean all ancestors, including cached RDDs like `userInBlocks`? Did you test the implementation with many iterations? If the shuffle files are deleted, how does ALS recover from node failures? Please run a job and kill some nodes during the middle and see what happens. My recommendation is cleaning shuffle files up to the cached RDDs and stop there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-209688648
  
    **[Test build #55745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55745/consoleFull)** for PR 11919 at commit [`e7b6413`](https://github.com/apache/spark/commit/e7b6413fb6bee12d344f85b42f8133ac543e171e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by DeepSparkBot <gi...@git.apache.org>.

Github user DeepSparkBot commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-204490237
  
    @mengxr @jkbradley Please advise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708640
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val filesBefore = getAllFiles
    +        val input = sc.parallelize(1 to 1000)
    +        val keyed = input.map(x => (x % 20, 1))
    +        val shuffled = keyed.reduceByKey(_ + _)
    +        val keysOnly = shuffled.map{case (x, _) => x}
    +        val deps = keysOnly.dependencies
    +        keysOnly.checkpoint()
    --- End diff --
    
    Is it necessary to include `checkpoint` in the test of "ALS.cleanShuffleDependencies"? Without checkpoint, we can still call "cleanShuffleDependencies".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214568575
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-204544959
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60926339
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -673,9 +675,11 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,
               userLocalIndexEncoder, solver = solver)
             if (shouldCheckpoint(iter)) {
    +          val deps = itemFactors.dependencies
               itemFactors.checkpoint()
               itemFactors.count() // checkpoint item factors and cut lineage
               deletePreviousCheckpointFile()
    +          ALS.cleanDependencies(sc, deps)
    --- End diff --
    
    minor but let's move this up one line to be consistent with the implicit case above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r59927220
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    In the implicit case, `itemFactors` should be materialized in `computeYtY` (uses `aggregate`). It actually happens in L663 below (answering my previous question about why the checkpoint file deletion occurs after `userFactors` is computed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-209688857
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55745/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207069258
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55233/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61319293
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return
    +    }
    +    val cleaner = sc.cleaner.get
    --- End diff --
    
    probably, I'll change it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-204544960
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54709/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708539
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    --- End diff --
    
    Is it relevant to the test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61769915
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val checkpointDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(checkpointDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val input = sc.parallelize(1 to 1000)
    +        val keyed = input.map(x => (x % 20, 1))
    +        val shuffled = keyed.reduceByKey(_ + _)
    +        val keysOnly = shuffled.keys
    +        val deps = keysOnly.dependencies
    +        keysOnly.count()
    +        ALS.cleanShuffleDependencies(sc, deps, true)
    +        val resultingFiles = getAllFiles
    +        assert(resultingFiles === Set())
    +        // Ensure running count again works fine even if we kill the shuffle files.
    +        keysOnly.count()
    +      } finally {
    +        sc.stop()
    +      }
    +    } finally {
    +      Utils.deleteRecursively(localDir)
    +      Utils.deleteRecursively(checkpointDir)
    +    }
    +  }
    +
    +  test("ALS shuffle cleanup") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val checkpointDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(checkpointDir.getAbsolutePath)
    +        // Generate test data
    +        val (training, _) = ALSSuite.genImplicitTestData(sc, 100, 10, 1, 0.2, 0)
    +        // Implicitly test the cleaning of parents during ALS training
    +        val sqlContext = new SQLContext(sc)
    +        import sqlContext.implicits._
    +        val als = new ALS()
    +          .setRank(1)
    +          .setRegParam(1e-5)
    +          .setSeed(0)
    +          .setCheckpointInterval(1)
    +          .setMaxIter(50)
    --- End diff --
    
    I've bumped it down from 50 to 8 so the test takes about ~8 seconds on my machine. @mengxr asked for an explicit `maxIter` earlier.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61316385
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return
    +    }
    +    val cleaner = sc.cleaner.get
    +    /**
    +     * Clean the shuffles & all of its parents.
    +     */
    +    def cleanEagerly(dep: Dependency[_]): Unit = {
    +      if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    +        val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    +        cleaner.doCleanupShuffle(shuffleId, blocking)
    +      }
    +      val rdd = dep.rdd
    --- End diff --
    
    Get dep.rdd.dependencies once here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207105068
  
    **[Test build #55248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55248/consoleFull)** for PR 11919 at commit [`c1493a1`](https://github.com/apache/spark/commit/c1493a167a063128a8badd8b5ab4ff7293c57945).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708500
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1355,4 +1359,28 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    sc.cleaner.foreach{ cleaner =>
    --- End diff --
    
    space before `{`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216456513
  
    LGTM. Merged into master and branch-2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61316554
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -541,4 +585,82 @@ object ALSSuite {
         "nonnegative" -> true,
         "checkpointInterval" -> 20
       )
    +
    +  // Helper functions to generate test data we share between ALS test suites
    +
    +  /**
    +   * Generates random user/item factors, with i.i.d. values drawn from U(a, b).
    +   * @param size number of users/items
    +   * @param rank number of features
    +   * @param random random number generator
    +   * @param a min value of the support (default: -1)
    +   * @param b max value of the support (default: 1)
    +   * @return a sequence of (ID, factors) pairs
    +   */
    +  private def genFactors(
    --- End diff --
    
    These functions are just a straight copy/move of the previous code right? no changes to see?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213099707
  
    **[Test build #56584 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56584/consoleFull)** for PR 11919 at commit [`58be94b`](https://github.com/apache/spark/commit/58be94b823d663af35fc2db46be5b23ba2c46cea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61543239
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1317,23 +1317,22 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
       private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
           blocking: Boolean = false): Unit = {
         // If there is no reference tracking we skip clean up.
    -    if (sc.cleaner.isEmpty) {
    -      return
    -    }
    -    val cleaner = sc.cleaner.get
    -    /**
    -     * Clean the shuffles & all of its parents.
    -     */
    -    def cleanEagerly(dep: Dependency[_]): Unit = {
    -      if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    -        val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    -        cleaner.doCleanupShuffle(shuffleId, blocking)
    -      }
    -      val rdd = dep.rdd
    -      if (rdd.dependencies != null) {
    -        rdd.dependencies.foreach(cleanEagerly)
    +    sc.cleaner.foreach{cleaner =>
    --- End diff --
    
    minor nit - spaces: `foreach { cleaner =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216332997
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214482289
  
    **[Test build #56915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56915/consoleFull)** for PR 11919 at commit [`e980596`](https://github.com/apache/spark/commit/e980596cc1f87893ef90a4eb4d17937f00fad343).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216088511
  
    Merged with the latest from master (got out of sync) and made the minor CR changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r58662187
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    --- End diff --
    
    Good catch, I'll fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214546851
  
    **[Test build #56929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56929/consoleFull)** for PR 11919 at commit [`83a938e`](https://github.com/apache/spark/commit/83a938e8d4a42bf1f07a42114f1b01c1ff24b805).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61845452
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1355,4 +1359,28 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    sc.cleaner.foreach{ cleaner =>
    +      /**
    +       * Clean the shuffles & all of its parents.
    +       */
    +      def cleanEagerly(dep: Dependency[_]): Unit = {
    +        if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    +          val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    +          cleaner.doCleanupShuffle(shuffleId, blocking)
    --- End diff --
    
    Thanks! The second `count()` demonstrates the behavior clearly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61764867
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    --- End diff --
    
    This is not about the style. It is not clear about what it does from the name, especially when the two tests are named "Clean shuffles" and "ALS shuffle cleanup" and both are under "ALSCleanerSuite". You don't need to use the name suggested by me. Just make sure the name concisely describe the test itself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-214523146
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207196167
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55295/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-215252927
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57171/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-206067657
  
    **[Test build #55058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55058/consoleFull)** for PR 11919 at commit [`ba36127`](https://github.com/apache/spark/commit/ba361273006782e53b1e1b002edb0ac27ef9b4b9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `    .doc(\"The output committer class used by Parquet. The specified class needs to be a \" +`
      * `    .doc(\"A comma separated list of class prefixes that should be loaded using the classloader \" +`
      * `    .doc(\"A comma separated list of class prefixes that should explicitly be reloaded for each \" +`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-209657823
  
    **[Test build #55745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55745/consoleFull)** for PR 11919 at commit [`e7b6413`](https://github.com/apache/spark/commit/e7b6413fb6bee12d344f85b42f8133ac543e171e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213676097
  
    **[Test build #56779 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56779/consoleFull)** for PR 11919 at commit [`dd50130`](https://github.com/apache/spark/commit/dd5013002611d3c232b8384eef89f13f9113eef4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216097487
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60968178
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,13 +656,15 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             previousItemFactors.unpersist()
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
    +        val deps = itemFactors.dependencies
    --- End diff --
    
    So it needs to be outside of the conditional because if I put it inside of the conditional it won't be in scope. I could make it an Option and capture the deps there but I think that would be a little convoluted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-201080297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-206182667
  
    **[Test build #55103 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55103/consoleFull)** for PR 11919 at commit [`84474e1`](https://github.com/apache/spark/commit/84474e18d7f921d22a393d42731e8aac1982be42).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207888805
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55447/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61318965
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -541,4 +585,82 @@ object ALSSuite {
         "nonnegative" -> true,
         "checkpointInterval" -> 20
       )
    +
    +  // Helper functions to generate test data we share between ALS test suites
    +
    +  /**
    +   * Generates random user/item factors, with i.i.d. values drawn from U(a, b).
    +   * @param size number of users/items
    +   * @param rank number of features
    +   * @param random random number generator
    +   * @param a min value of the support (default: -1)
    +   * @param b max value of the support (default: 1)
    +   * @return a sequence of (ID, factors) pairs
    +   */
    +  private def genFactors(
    --- End diff --
    
    only change is this genImplicitTestData takes a SparkContext and the other used the one from the class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708525
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    --- End diff --
    
    `checkpointDir`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-212672336
  
    @MLnick As mentioned in the line comments, that approach turns out to not be as simple as planned. checkpointing kills all of the parents information we need to clean up the shuffle files. I could refactor this so that we capture the dependency information needed - but a count() on a cached RDD should be low enough cost I'm not sure it would be worth it. What are your thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216076304
  
    **[Test build #57484 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57484/consoleFull)** for PR 11919 at commit [`95abf54`](https://github.com/apache/spark/commit/95abf54726bf201d1dc9e0511ac25ed6e12946be).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708499
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1355,4 +1359,28 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    --- End diff --
    
    Please follow Spark Code Style guide and chop down the args.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213664760
  
    **[Test build #56779 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56779/consoleFull)** for PR 11919 at commit [`dd50130`](https://github.com/apache/spark/commit/dd5013002611d3c232b8384eef89f13f9113eef4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61769074
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1355,4 +1359,28 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    sc.cleaner.foreach{ cleaner =>
    +      /**
    +       * Clean the shuffles & all of its parents.
    +       */
    +      def cleanEagerly(dep: Dependency[_]): Unit = {
    +        if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    +          val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    +          cleaner.doCleanupShuffle(shuffleId, blocking)
    --- End diff --
    
    So my understanding is that previously the ttl cleaner could clean up check point files and Spark would still need to handle recompute. Of course I've gone ahead and gone with the safer choice - although I also modified the first test to illustrate that recompute works fine (we clean up the shuffle files of a non-cached RDD and then call count).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-201044084
  
    jenkins retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r59932613
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    That makes sense, I'll break it up to just clean the parents.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216076340
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57484/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213156622
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56591/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r58854819
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -672,7 +673,7 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,
               userLocalIndexEncoder, solver = solver)
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint()
    +          ALS.checkpointAndCleanParents(itemFactors)
               itemFactors.count() // checkpoint item factors and cut lineage
               deletePreviousCheckpointFile()
    --- End diff --
    
    @mengxr I haven't looked at this code in quite a while - can you remind me why in the explicit case `deletePreviousCheckpointFile` and `previousCheckpointFile = itemFactors.getCheckpointFile` happens before `userFactors` is computed, while in the implicit case it happens after?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61708553
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val tempDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(tempDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val filesBefore = getAllFiles
    +        val input = sc.parallelize(1 to 1000)
    +        val keyed = input.map(x => (x % 20, 1))
    +        val shuffled = keyed.reduceByKey(_ + _)
    +        val keysOnly = shuffled.map{case (x, _) => x}
    --- End diff --
    
    * space around `{` and before `}`
    * This is `shuffled.keys`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-210619387
  
    Sounds reasonable, I'll also do some manual testing after the refactor (and or I could add some automatic tests using the entire ALS pipeline instead of just the cleaner component).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207137101
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55248/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207044799
  
    **[Test build #55233 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55233/consoleFull)** for PR 11919 at commit [`c1493a1`](https://github.com/apache/spark/commit/c1493a167a063128a8badd8b5ab4ff7293c57945).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60725882
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1306,33 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to checkpoint the RDD and clean up its all of its parents' shuffles eagerly.
    +   */
    +  private[spark] def checkpointAndCleanParents[T](rdd: RDD[T], blocking: Boolean = false): Unit = {
    +    val sc = rdd.sparkContext
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return rdd.checkpoint()
    --- End diff --
    
    If we bail out early here, then we don't do a `count`. So in the explicit case, I believe the checkpoint won't work correctly at [L677](https://github.com/apache/spark/pull/11919/files#diff-be65dd1d6adc53138156641b610fcadaR677).
    
    Perhaps we can refactor like:
    ```scala
    val cleaner = rdd.sparkContext.cleaner
    val deps = rdd.dependencies
    rdd.checkpoint()
    rdd.count()
    cleaner.foreach { c =>
      deps.foreach(_ => cleanEagerly(_, c))
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213676197
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-206067768
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55058/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207196165
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-201079951
  
    **[Test build #54100 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54100/consoleFull)** for PR 11919 at commit [`2f19399`](https://github.com/apache/spark/commit/2f1939904259ac625f839c5cbca90790ead37ef5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ALSCleanerSuite extends SparkFunSuite `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207102843
  
    Jenkins retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207862384
  
    **[Test build #55447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55447/consoleFull)** for PR 11919 at commit [`324efc7`](https://github.com/apache/spark/commit/324efc72fb37d23783b52f8c3eca686c67515f36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-206242689
  
    **[Test build #55103 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55103/consoleFull)** for PR 11919 at commit [`84474e1`](https://github.com/apache/spark/commit/84474e18d7f921d22a393d42731e8aac1982be42).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216097418
  
    **[Test build #57493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57493/consoleFull)** for PR 11919 at commit [`95abf54`](https://github.com/apache/spark/commit/95abf54726bf201d1dc9e0511ac25ed6e12946be).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216301086
  
    **[Test build #57538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57538/consoleFull)** for PR 11919 at commit [`0ed7deb`](https://github.com/apache/spark/commit/0ed7debdc6e5504b4f81ce62a599c585a9990f37).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216332792
  
    **[Test build #57538 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57538/consoleFull)** for PR 11919 at commit [`0ed7deb`](https://github.com/apache/spark/commit/0ed7debdc6e5504b4f81ce62a599c585a9990f37).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-209688854
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61543683
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,29 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    sc.cleaner.foreach{cleaner =>
    +      /**
    +       * Clean the shuffles & all of its parents.
    +       */
    +      def cleanEagerly(dep: Dependency[_]): Unit = {
    +        if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    +          val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    +          cleaner.doCleanupShuffle(shuffleId, blocking)
    +        }
    +        val rdd = dep.rdd
    +        val rddDeps = rdd.dependencies
    --- End diff --
    
    again minor, but I think as per Sean's comment we can make this one line `val rddDeps = dep.rdd.dependencies`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-215221717
  
    **[Test build #57171 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57171/consoleFull)** for PR 11919 at commit [`868977c`](https://github.com/apache/spark/commit/868977cdae3279a87ede1c3b3c3c3d45246f419d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61543360
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,29 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    sc.cleaner.foreach{cleaner =>
    --- End diff --
    
    minor nit - should be spaces around `{`, i.e. `foreach { cleaner =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14099][ML] Clear shuffle files after ch...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-200545522
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-206243499
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55103/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-205986042
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r58660406
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    --- End diff --
    
    Is there a reason we're not also using `checkpointAndCleanParents` in the explicit prefs code path below?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213156195
  
    **[Test build #56591 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56591/consoleFull)** for PR 11919 at commit [`b78d9f6`](https://github.com/apache/spark/commit/b78d9f6f83d69710d0789c36a1393dbec193543e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213138057
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-213676200
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56779/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-206243490
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207888247
  
    **[Test build #55447 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55447/consoleFull)** for PR 11919 at commit [`324efc7`](https://github.com/apache/spark/commit/324efc72fb37d23783b52f8c3eca686c67515f36).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61323587
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -1306,4 +1310,30 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
        * satisfies this requirement, we simply use a type alias here.
        */
       private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
    +
    +  /**
    +   * Private function to clean up all of the shuffles files from the dependencies and their parents.
    +   */
    +  private[spark] def cleanShuffleDependencies[T](sc: SparkContext, deps: Seq[Dependency[_]],
    +      blocking: Boolean = false): Unit = {
    +    // If there is no reference tracking we skip clean up.
    +    if (sc.cleaner.isEmpty) {
    +      return
    +    }
    +    val cleaner = sc.cleaner.get
    +    /**
    +     * Clean the shuffles & all of its parents.
    +     */
    +    def cleanEagerly(dep: Dependency[_]): Unit = {
    +      if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
    +        val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _, _]].shuffleId
    +        cleaner.doCleanupShuffle(shuffleId, blocking)
    --- End diff --
    
    This is the core of it and it sounds like it's what Xiangrui intended. I can't say I'm an expert on the cleaner though


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-201045232
  
    **[Test build #54100 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54100/consoleFull)** for PR 11919 at commit [`2f19399`](https://github.com/apache/spark/commit/2f1939904259ac625f839c5cbca90790ead37ef5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-215652391
  
    @holdenk a couple minor nit-picking comments :) I'm happy with this. @mengxr I will leave open for a day or two if you have time for a quick pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216332998
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57538/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216076339
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216070750
  
    **[Test build #57484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57484/consoleFull)** for PR 11919 at commit [`95abf54`](https://github.com/apache/spark/commit/95abf54726bf201d1dc9e0511ac25ed6e12946be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207068984
  
    **[Test build #55233 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55233/consoleFull)** for PR 11919 at commit [`c1493a1`](https://github.com/apache/spark/commit/c1493a167a063128a8badd8b5ab4ff7293c57945).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-207195938
  
    **[Test build #55295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55295/consoleFull)** for PR 11919 at commit [`c1493a1`](https://github.com/apache/spark/commit/c1493a167a063128a8badd8b5ab4ff7293c57945).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r58854532
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -672,7 +673,7 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,
               userLocalIndexEncoder, solver = solver)
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint()
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    `cleanEagerly` does a `rdd.count()` and we do another one just below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-215252746
  
    **[Test build #57171 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57171/consoleFull)** for PR 11919 at commit [`868977c`](https://github.com/apache/spark/commit/868977cdae3279a87ede1c3b3c3c3d45246f419d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r59928566
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
             itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    We shouldn't clean the shuffle dependencies until the RDD has actually been materialized, yes? In which case, perhaps we can just have the method `checkpointAndCleanParents` be `cleanParents`, and keep the existing `checkpoint` and `count` calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61709638
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    --- End diff --
    
    This doesn't match the style of the rest of the tests in this file, want me to update all of them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11919#issuecomment-216176313
  
    **[Test build #57520 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57520/consoleFull)** for PR 11919 at commit [`3a18915`](https://github.com/apache/spark/commit/3a18915ee9c803deb494eee77ba960646e2d470c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6717][ML] Clear shuffle files after che...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r61765218
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -520,6 +488,81 @@ class ALSSuite
       }
     }
     
    +class ALSCleanerSuite extends SparkFunSuite {
    +  test("Clean shuffles") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val checkpointDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(checkpointDir.getAbsolutePath)
    +        // Test checkpoint and clean parents
    +        val input = sc.parallelize(1 to 1000)
    +        val keyed = input.map(x => (x % 20, 1))
    +        val shuffled = keyed.reduceByKey(_ + _)
    +        val keysOnly = shuffled.keys
    +        val deps = keysOnly.dependencies
    +        keysOnly.count()
    +        ALS.cleanShuffleDependencies(sc, deps, true)
    +        val resultingFiles = getAllFiles
    +        assert(resultingFiles === Set())
    +        // Ensure running count again works fine even if we kill the shuffle files.
    +        keysOnly.count()
    +      } finally {
    +        sc.stop()
    +      }
    +    } finally {
    +      Utils.deleteRecursively(localDir)
    +      Utils.deleteRecursively(checkpointDir)
    +    }
    +  }
    +
    +  test("ALS shuffle cleanup") {
    +    val conf = new SparkConf()
    +    val localDir = Utils.createTempDir()
    +    val checkpointDir = Utils.createTempDir()
    +    def getAllFiles: Set[File] =
    +      FileUtils.listFiles(localDir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE).asScala.toSet
    +    try {
    +      conf.set("spark.local.dir", localDir.getAbsolutePath)
    +      conf.set("spark.shuffle.manager", "sort")
    +      val sc = new SparkContext("local[2]", "test", conf)
    +      try {
    +        sc.setCheckpointDir(checkpointDir.getAbsolutePath)
    +        // Generate test data
    +        val (training, _) = ALSSuite.genImplicitTestData(sc, 100, 10, 1, 0.2, 0)
    +        // Implicitly test the cleaning of parents during ALS training
    +        val sqlContext = new SQLContext(sc)
    +        import sqlContext.implicits._
    +        val als = new ALS()
    +          .setRank(1)
    +          .setRegParam(1e-5)
    +          .setSeed(0)
    +          .setCheckpointInterval(1)
    +          .setMaxIter(50)
    --- End diff --
    
    This test takes 20 seconds to run on my local. Please optimize the time required.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org