You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by andrewor14 <gi...@git.apache.org> on 2014/08/29 00:57:52 UTC

[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/2187

    [SPARK-3277] Fix external spilling with LZ4 assertion error

    The bulk of this PR is comprised of tests and documentation; the actual fix is really just adding 1 line of code (see `BlockObjectWriter.scala`). We currently do not run the `External*` test suites with different compression codecs, and this would have caught the bug reported in SPARK-3277. This PR extends the existing code to test spilling using all compression codecs known to Spark, including `LZ4`.
    
    **The actual bug**
    
    In `DiskBlockObjectWriter`, we only report the shuffle bytes written before we close the streams. With `LZ4`, all the bytes written in our metrics were 0 because `flush()` was not taking effect for some reason. In general, compression codecs may write additional bytes to the file after we call `close()`, and so we must also capture those bytes in our shuffle write metrics.
    
    Thanks @mridulm and @pwendell for help with debugging.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark fix-lz4-spilling

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2187.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2187
    
----
commit 1bfa7438f8cb5ef00842e67f343dd02eb9679c8a
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-28T20:15:32Z

    Add more information to assert for better debugging

commit b264a84bcd848609f26ba80f23391c439369180e
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-28T20:59:07Z

    ExternalAppendOnlyMapSuite code style fixes (minor)

commit 4bbcf68cad63d6f787a445e2eb724fcd6e6acb7c
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-28T21:38:25Z

    Update tests to actually test all compression codecs

commit 089593f6d388980694f031641091b58e1d8dcfc7
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-28T21:57:59Z

    Actually fix SPARK-3277 (tests still fail)

commit a1ad53620d209837cc456e5789c7b73d7e1b8b80
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-28T22:42:11Z

    Fix tests
    
    We need to stop the SparkContexts before creating a new one.
    Otherwise the tests get into bad states.

commit 92e251bae0f354cfe8350497d2ca0bb2bdd8028b
Author: Patrick Wendell <pw...@gmail.com>
Date:   2014-08-28T20:54:01Z

    Better documentation for BlockObjectWriter.

commit 6b2e7d155457043b967e03743c0d22556d818a3e
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-28T22:45:26Z

    Fix compilation error

commit 1c4624ed0d351375d4ff3bcb6384a27fe2b98fd5
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-28T22:45:48Z

    Merge branch 'master' of github.com:apache/spark into fix-lz4-spilling

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53912516
  
    Maybe I'm missing something, but I don't see a way for the writer to be reverted after `close()` is called. The only place to revert the writes is through `revertPartialWritesAndClose()`, but this should not be called after `close()` anyway. Perhaps we should add some safeguard against calling anything else in this class after we call `close()`, but as far as I'm concerned the existing code is correct since we don't use `reportedPosition` again after `close()`, so there is no need to update it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53826175
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19437/consoleFull) for   PR 2187 at commit [`1b54bdc`](https://github.com/apache/spark/commit/1b54bdc5d66568129fd9d47eeb5811df12957377).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2187#discussion_r16874969
  
    --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala ---
    @@ -200,77 +204,127 @@ class ExternalAppendOnlyMapSuite extends FunSuite with LocalSparkContext {
     
         result.foreach { case (i, (seq1, seq2)) =>
           i match {
    -        case 0 => assert(seq1.toSet == Set[Int]() && seq2.toSet == Set[Int](2, 4))
    -        case 1 => assert(seq1.toSet == Set[Int](1) && seq2.toSet == Set[Int](1, 3))
    -        case 2 => assert(seq1.toSet == Set[Int](2) && seq2.toSet == Set[Int]())
    -        case 3 => assert(seq1.toSet == Set[Int](3) && seq2.toSet == Set[Int]())
    -        case 4 => assert(seq1.toSet == Set[Int](4) && seq2.toSet == Set[Int]())
    +        case 0 => assert(seq1.toSet === Set[Int]() && seq2.toSet === Set[Int](2, 4))
    +        case 1 => assert(seq1.toSet === Set[Int](1) && seq2.toSet === Set[Int](1, 3))
    +        case 2 => assert(seq1.toSet === Set[Int](2) && seq2.toSet === Set[Int]())
    +        case 3 => assert(seq1.toSet === Set[Int](3) && seq2.toSet === Set[Int]())
    +        case 4 => assert(seq1.toSet === Set[Int](4) && seq2.toSet === Set[Int]())
           }
         }
    +    sc.stop()
       }
     
    +  /**
    +   * For tests that involve spilling, run them multiple times with different compression settings.
    +   */
    +
       test("spilling") {
    -    val conf = createSparkConf(true)  // Load defaults, otherwise SPARK_HOME is not found
    +    runSpillingTest(testSpilling)
    +  }
    +
    +  test("spilling with hash collisions") {
    +    runSpillingTest(testSpillingWithCollisions)
    +  }
    +
    +  test("spilling with many hash collisions") {
    +    runSpillingTest(testSpillingWithManyCollisions)
    +  }
    +
    +  test("spilling with hash collisions using the Int.MaxValue key") {
    +    runSpillingTest(testSpillingWithCollisionsMaxInt)
    +  }
    +
    +  test("spilling with null keys and values") {
    +    runSpillingTest(testSpillingWithNullKeysAndValues)
    +  }
    +
    +  /* ------------------------------------ *
    +   * Actual test logic for spilling tests *
    +   * ------------------------------------ */
    +
    +  /**
    +   * Run a spilling test multiple times, with and without compression and using all codecs.
    +   */
    +  private def runSpillingTest(test: Option[String] => Unit): Unit = {
    --- End diff --
    
    A cool trick I used in YarnSparkHadoopUtilSuite is to do something like this:
    
        def spillingTest(name: String)(testFn: => Unit) = test(name) {
            blah blah blah testFn() blah
        }
    
    Then you can register like this:
    
        spillingTest("spilling with null keys and values") {
           // body of testSpillingWithNullKeysAndValues
        }
    
    Would look a little bit cleaner overall, but no biggie.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53902159
  
    After a second look, close seems to be idempotent (albeit not thread-safe, but that's ok?). So it seems that the only problem is not updating `reportedPosition` in commitAndClose()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53897378
  
    Hi @mridulm 
    
    >  On commitAndClose, reportedPosition is not updated.
    >  So a subsequent close or revert won't update bytesWritten properly.
    
    Hmmm, maybe it's my unfamiliarity with the innards here, but why would you call close or revert after you've already committed and closed the writer?
    
    (Perhaps close() could be made idempotent, but that would just protect against buggy code?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53818587
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19436/consoleFull) for   PR 2187 at commit [`1c4624e`](https://github.com/apache/spark/commit/1c4624ed0d351375d4ff3bcb6384a27fe2b98fd5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53822823
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19437/consoleFull) for   PR 2187 at commit [`1b54bdc`](https://github.com/apache/spark/commit/1b54bdc5d66568129fd9d47eeb5811df12957377).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53929877
  
    Writers are not thread safe, so that is fine.
    Updating reportedPosition might be sufficient; a naive analysis did suggest
    that was minimal change required to this pr.
    
    But this is slightly involved code reused in various forms; with extremely
    minimal testcases ... Hence why we need to be very careful making changes
    to it (similar to the mess that is Connection/ConnectionManager).
    On 29-Aug-2014 10:25 pm, "Marcelo Vanzin" <no...@github.com> wrote:
    
    > After a second look, close seems to be idempotent (albeit not thread-safe,
    > but that's ok?). So it seems that the only problem is not updating
    > reportedPosition in commitAndClose()?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/2187#issuecomment-53902159>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53900710
  
    A closed writer can be reverted in case some other writer's close fails in the shuffle group and we have to revert the entire group.
    Particularly relevant when we have consolidated shuffle.
    
    Yes, close should be idempotent - which is not the case here : each close will add to metrics


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53822687
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19436/consoleFull) for   PR 2187 at commit [`1c4624e`](https://github.com/apache/spark/commit/1c4624ed0d351375d4ff3bcb6384a27fe2b98fd5).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53831182
  
    This change looks incorrect.
    On commitAndClose, reportedPosition is not updated.
    So a subsequent close or revert won't update bytesWritten properly.
    
    Since this was a blocker, would have been better to not rush into committing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53821142
  
    LGTM, kinda skimmed through the tests since it looked like most changes were changing `==` for `===` and other minor things like that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/2187#issuecomment-53929367
  
    I don't have my workspace handy, but search for revert usages ... That
    should answer where it is called.
    Since I added the error handling, I am fairly certain you will have revert
    after close in case of issues : not doing this actually breaks consolidated
    shuffle (1.0 behavior)
    On 29-Aug-2014 11:52 pm, "andrewor14" <no...@github.com> wrote:
    
    > Maybe I'm missing something, but I don't see a way for the writer to be
    > reverted after close() is called. The only place to revert the writes is
    > through revertPartialWritesAndClose(), but this should not be called
    > after close() anyway. Perhaps we should add some safeguard against
    > calling anything else in this class after we call close(), but as far as
    > I'm concerned the existing code is correct since we don't use
    > reportedPosition again after close(), so there is no need to update it.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/2187#issuecomment-53912516>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3277] Fix external spilling with LZ4 as...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2187


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org