You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2016/03/03 20:25:02 UTC

[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/11498

    [SPARK-13653] Split disk writer into separate object and binary writers

    This patch splits `DiskBlockObjectWriter` into two separate classes so that the object-based and byte-based write methods are cleanly separated. The old design, which put all of these methods in the same interface, caused some weird problems in places which only wanted to use the writer's `OutputStream` methods, since those places would need to pass a `DummySerializerInstance` into the writer's constructor.
    
    This patch introduces a `DiskBlockWriter` which exposes only the OutputStream write methods and modifies the `DiskBlockObjectWriter` to wrap `DiskBlockWriter` and expose only object-based write methods.
    
    I also moved the disk-writing-related classes into a new `o.a.s.storage.disk` package, since the `o.a.s.storage` package had too many classes in it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark disk-writer-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11498.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11498
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195591564
  
    **[Test build #52944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52944/consoleFull)** for PR 11498 at commit [`8b8f4cb`](https://github.com/apache/spark/commit/8b8f4cb9771f93de70f3c56a3dff61bae32857fa).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195593973
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52945/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195593967
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11498#discussion_r57048124
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/disk/DiskBlockObjectWriter.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.storage.disk
    +
    +import java.io.{File, OutputStream}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.serializer.{SerializationStream, SerializerInstance}
    +import org.apache.spark.storage.BlockId
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A class for writing JVM objects directly to a file on disk. This class allows data to be appended
    + * to an existing block and can guarantee atomicity in the case of faults as it allows the caller to
    + * revert partial writes.
    + *
    + * This class does not support concurrent writes. Also, once the writer has been opened it cannot be
    + * reopened again.
    + */
    +private[spark] class DiskBlockObjectWriter(
    +    diskBlockWriter: DiskBlockWriter,
    +    serializerInstance: SerializerInstance,
    +    val blockId: BlockId = null)
    +  extends Logging {
    +
    +  private var bs: OutputStream = null
    +  private var objOut: SerializationStream = null
    +  private var initialized = false
    +  private var hasBeenClosed = false
    +  private var commitAndCloseHasBeenCalled = false
    +
    +  def file: File = diskBlockWriter.file
    +
    +  def open(): DiskBlockObjectWriter = {
    +    if (hasBeenClosed) {
    +      throw new IllegalStateException("Writer already closed. Cannot be reopened.")
    +    }
    +    bs = diskBlockWriter.open()
    +    objOut = serializerInstance.serializeStream(bs)
    +    initialized = true
    +    this
    +  }
    +
    +  def close() {
    +    if (initialized) {
    +      Utils.tryWithSafeFinally {
    +        if (diskBlockWriter.syncWrites) {
    +          objOut.flush()
    +        }
    +      } {
    +        objOut.close()
    +      }
    +      bs = null
    +      objOut = null
    +      initialized = false
    +      hasBeenClosed = true
    +    }
    +  }
    +
    +  /**
    +   * Flush the partial writes and commit them as a single atomic block.
    +   */
    +  def commitAndClose(): Unit = {
    +    if (initialized) {
    +      objOut.flush()
    +    }
    +    diskBlockWriter.commitAndClose()
    +    commitAndCloseHasBeenCalled = true
    +  }
    +
    +
    +  /**
    +   * Reverts writes that haven't been flushed yet. Callers should invoke this function
    +   * when there are runtime exceptions. This method will not throw, though it may be
    +   * unsuccessful in truncating written data.
    +   *
    +   * @return the file that this DiskBlockObjectWriter wrote to.
    +   */
    +  def revertPartialWritesAndClose(): File = {
    +    if (initialized) {
    +      objOut.flush()
    --- End diff --
    
    Should we catch the exception here? (same as before)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11498#discussion_r57097023
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/disk/DiskBlockObjectWriter.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.storage.disk
    +
    +import java.io.{File, OutputStream}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.serializer.{SerializationStream, SerializerInstance}
    +import org.apache.spark.storage.BlockId
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A class for writing JVM objects directly to a file on disk. This class allows data to be appended
    + * to an existing block and can guarantee atomicity in the case of faults as it allows the caller to
    + * revert partial writes.
    + *
    + * This class does not support concurrent writes. Also, once the writer has been opened it cannot be
    + * reopened again.
    + */
    +private[spark] class DiskBlockObjectWriter(
    +    diskBlockWriter: DiskBlockWriter,
    +    serializerInstance: SerializerInstance,
    +    val blockId: BlockId = null)
    +  extends Logging {
    +
    +  private var bs: OutputStream = null
    --- End diff --
    
    Agreed; good catch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11498#discussion_r57047616
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/disk/DiskBlockObjectWriter.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.storage.disk
    +
    +import java.io.{File, OutputStream}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.serializer.{SerializationStream, SerializerInstance}
    +import org.apache.spark.storage.BlockId
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A class for writing JVM objects directly to a file on disk. This class allows data to be appended
    + * to an existing block and can guarantee atomicity in the case of faults as it allows the caller to
    + * revert partial writes.
    + *
    + * This class does not support concurrent writes. Also, once the writer has been opened it cannot be
    + * reopened again.
    + */
    +private[spark] class DiskBlockObjectWriter(
    +    diskBlockWriter: DiskBlockWriter,
    +    serializerInstance: SerializerInstance,
    +    val blockId: BlockId = null)
    +  extends Logging {
    +
    +  private var bs: OutputStream = null
    --- End diff --
    
    It maybe easier to understand if we use diskBlockWriter instead of this `bs`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11498#discussion_r57047243
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/disk/DiskBlockObjectWriter.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.storage.disk
    +
    +import java.io.{File, OutputStream}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.serializer.{SerializationStream, SerializerInstance}
    +import org.apache.spark.storage.BlockId
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A class for writing JVM objects directly to a file on disk. This class allows data to be appended
    + * to an existing block and can guarantee atomicity in the case of faults as it allows the caller to
    + * revert partial writes.
    + *
    + * This class does not support concurrent writes. Also, once the writer has been opened it cannot be
    + * reopened again.
    + */
    +private[spark] class DiskBlockObjectWriter(
    +    diskBlockWriter: DiskBlockWriter,
    +    serializerInstance: SerializerInstance,
    +    val blockId: BlockId = null)
    +  extends Logging {
    +
    +  private var bs: OutputStream = null
    +  private var objOut: SerializationStream = null
    +  private var initialized = false
    +  private var hasBeenClosed = false
    +  private var commitAndCloseHasBeenCalled = false
    +
    +  def file: File = diskBlockWriter.file
    +
    +  def open(): DiskBlockObjectWriter = {
    +    if (hasBeenClosed) {
    +      throw new IllegalStateException("Writer already closed. Cannot be reopened.")
    +    }
    +    bs = diskBlockWriter.open()
    +    objOut = serializerInstance.serializeStream(bs)
    +    initialized = true
    +    this
    +  }
    +
    +  def close() {
    +    if (initialized) {
    +      Utils.tryWithSafeFinally {
    +        if (diskBlockWriter.syncWrites) {
    +          objOut.flush()
    +        }
    +      } {
    +        objOut.close()
    +      }
    +      bs = null
    +      objOut = null
    +      initialized = false
    +      hasBeenClosed = true
    --- End diff --
    
    Existing: Should these in the final clause?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-191925670
  
    **[Test build #52405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52405/consoleFull)** for PR 11498 at commit [`a33893d`](https://github.com/apache/spark/commit/a33893d9cf4b9996ce49d42cdf671a988e6e61ba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-200112739
  
    **[Test build #53867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53867/consoleFull)** for PR 11498 at commit [`0fee604`](https://github.com/apache/spark/commit/0fee6042a7cff69747f454056fd5cec6640dc381).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-200152103
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53867/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-191971864
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11498#discussion_r57098043
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/disk/DiskBlockObjectWriter.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.storage.disk
    +
    +import java.io.File
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.serializer.{SerializationStream, SerializerInstance}
    +import org.apache.spark.storage.BlockId
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A class for writing JVM objects directly to a file on disk. This class allows data to be appended
    + * to an existing block and can guarantee atomicity in the case of faults as it allows the caller to
    + * revert partial writes.
    + *
    + * This class does not support concurrent writes. Also, once the writer has been opened it cannot be
    + * reopened again.
    + */
    +private[spark] class DiskBlockObjectWriter(
    +    private[this] var diskBlockWriter: DiskBlockWriter,
    +    serializerInstance: SerializerInstance,
    +    val blockId: BlockId = null)
    +  extends Logging {
    +
    +  private var objOut: SerializationStream = null
    +  private var initialized = false
    +  private var hasBeenClosed = false
    +  private var commitAndCloseHasBeenCalled = false
    +
    +  def file: File = diskBlockWriter.file
    +
    +  def open(): DiskBlockObjectWriter = {
    +    if (hasBeenClosed) {
    +      throw new IllegalStateException("Writer already closed. Cannot be reopened.")
    +    }
    +    diskBlockWriter.open()
    +    objOut = serializerInstance.serializeStream(diskBlockWriter)
    +    initialized = true
    +    this
    +  }
    +
    +  def close() {
    +    Utils.tryWithSafeFinally {
    +      if (initialized) {
    +        if (diskBlockWriter.syncWrites) {
    +          objOut.flush()
    +        }
    +      }
    +    } {
    +      if (initialized) {
    +        objOut.close()
    +      }
    +      diskBlockWriter = null
    +      objOut = null
    +      initialized = false
    +      hasBeenClosed = true
    +    }
    +  }
    +
    +  /**
    +   * Flush the partial writes and commit them as a single atomic block.
    +   */
    +  def commitAndClose(): Unit = {
    +    if (initialized) {
    +      objOut.flush()
    --- End diff --
    
    Update: actually, the old code called `objOut.close()` as part of `close()`, so we still need to figure out how to make sure that happens here :( 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195600102
  
    **[Test build #52947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52947/consoleFull)** for PR 11498 at commit [`9fdde9f`](https://github.com/apache/spark/commit/9fdde9fe9f575699c2c63765618f13611b0e77cd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11498#discussion_r57097978
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/disk/DiskBlockObjectWriter.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.storage.disk
    +
    +import java.io.File
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.serializer.{SerializationStream, SerializerInstance}
    +import org.apache.spark.storage.BlockId
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A class for writing JVM objects directly to a file on disk. This class allows data to be appended
    + * to an existing block and can guarantee atomicity in the case of faults as it allows the caller to
    + * revert partial writes.
    + *
    + * This class does not support concurrent writes. Also, once the writer has been opened it cannot be
    + * reopened again.
    + */
    +private[spark] class DiskBlockObjectWriter(
    +    private[this] var diskBlockWriter: DiskBlockWriter,
    +    serializerInstance: SerializerInstance,
    +    val blockId: BlockId = null)
    +  extends Logging {
    +
    +  private var objOut: SerializationStream = null
    +  private var initialized = false
    +  private var hasBeenClosed = false
    +  private var commitAndCloseHasBeenCalled = false
    +
    +  def file: File = diskBlockWriter.file
    +
    +  def open(): DiskBlockObjectWriter = {
    +    if (hasBeenClosed) {
    +      throw new IllegalStateException("Writer already closed. Cannot be reopened.")
    +    }
    +    diskBlockWriter.open()
    +    objOut = serializerInstance.serializeStream(diskBlockWriter)
    +    initialized = true
    +    this
    +  }
    +
    +  def close() {
    +    Utils.tryWithSafeFinally {
    +      if (initialized) {
    +        if (diskBlockWriter.syncWrites) {
    +          objOut.flush()
    +        }
    +      }
    +    } {
    +      if (initialized) {
    +        objOut.close()
    +      }
    +      diskBlockWriter = null
    +      objOut = null
    +      initialized = false
    +      hasBeenClosed = true
    +    }
    +  }
    +
    +  /**
    +   * Flush the partial writes and commit them as a single atomic block.
    +   */
    +  def commitAndClose(): Unit = {
    +    if (initialized) {
    +      objOut.flush()
    --- End diff --
    
    Note that we don't explicitly close `objOut` here. Neither did the old code, which is why I preserved that behavior after the split.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195600654
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11498#discussion_r57045537
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java ---
    @@ -23,6 +23,7 @@
     import java.io.IOException;
     import javax.annotation.Nullable;
     
    +import org.apache.spark.storage.disk.DiskBlockObjectWriter;
    --- End diff --
    
    import order


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195544738
  
    **[Test build #52945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52945/consoleFull)** for PR 11498 at commit [`c04d3d7`](https://github.com/apache/spark/commit/c04d3d79dca195108aca5b5e59d9944e69313f37).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195591764
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52944/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-200151982
  
    **[Test build #53867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53867/consoleFull)** for PR 11498 at commit [`0fee604`](https://github.com/apache/spark/commit/0fee6042a7cff69747f454056fd5cec6640dc381).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-199963977
  
    This LGTM overall, once you address these comments, I think it's good to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-200152102
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-191971867
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52405/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-191971576
  
    **[Test build #52405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52405/consoleFull)** for PR 11498 at commit [`a33893d`](https://github.com/apache/spark/commit/a33893d9cf4b9996ce49d42cdf671a988e6e61ba).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DiskBlockManagerSuite extends SparkFunSuite with BeforeAndAfterEach with BeforeAndAfterAll `
      * `class DiskBlockObjectWriterSuite extends SparkFunSuite with BeforeAndAfterEach `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195543538
  
    **[Test build #52944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52944/consoleFull)** for PR 11498 at commit [`8b8f4cb`](https://github.com/apache/spark/commit/8b8f4cb9771f93de70f3c56a3dff61bae32857fa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195549353
  
    **[Test build #52947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52947/consoleFull)** for PR 11498 at commit [`9fdde9f`](https://github.com/apache/spark/commit/9fdde9fe9f575699c2c63765618f13611b0e77cd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195600655
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52947/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11498#discussion_r54933191
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/disk/DiskBlockObjectWriter.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.storage.disk
    +
    +import java.io.{File, OutputStream}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.serializer.{SerializationStream, SerializerInstance}
    +import org.apache.spark.storage.BlockId
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A class for writing JVM objects directly to a file on disk. This class allows data to be appended
    + * to an existing block and can guarantee atomicity in the case of faults as it allows the caller to
    + * revert partial writes.
    + *
    + * This class does not support concurrent writes. Also, once the writer has been opened it cannot be
    + * reopened again.
    + */
    +private[spark] class DiskBlockObjectWriter(
    +    diskBlockWriter: DiskBlockWriter,
    +    serializerInstance: SerializerInstance,
    +    val blockId: BlockId = null)
    +  extends Logging {
    +
    +  private var bs: OutputStream = null
    +  private var objOut: SerializationStream = null
    +  private var initialized = false
    +  private var hasBeenClosed = false
    +  private var commitAndCloseHasBeenCalled = false
    +
    +  def file: File = diskBlockWriter.file
    +
    +  def open(): DiskBlockObjectWriter = {
    +    if (hasBeenClosed) {
    +      throw new IllegalStateException("Writer already closed. Cannot be reopened.")
    +    }
    +    bs = diskBlockWriter.open()
    +    objOut = serializerInstance.serializeStream(bs)
    +    initialized = true
    +    this
    +  }
    +
    +  def close() {
    +    if (initialized) {
    +      Utils.tryWithSafeFinally {
    +        if (diskBlockWriter.syncWrites) {
    +          objOut.flush()
    +        }
    +      } {
    +        objOut.close()
    +      }
    +      bs = null
    +      objOut = null
    +      initialized = false
    +      hasBeenClosed = true
    +    }
    +  }
    +
    +  /**
    +   * Flush the partial writes and commit them as a single atomic block.
    +   */
    +  def commitAndClose(): Unit = {
    +    if (initialized) {
    +      objOut.flush()
    +    }
    +    diskBlockWriter.commitAndClose()
    +    commitAndCloseHasBeenCalled = true
    +  }
    +
    +
    +  /**
    +   * Reverts writes that haven't been flushed yet. Callers should invoke this function
    +   * when there are runtime exceptions. This method will not throw, though it may be
    +   * unsuccessful in truncating written data.
    +   *
    +   * @return the file that this DiskBlockObjectWriter wrote to.
    +   */
    +  def revertPartialWritesAndClose(): File = {
    +    if (initialized) {
    +      objOut.flush()
    +    }
    +    diskBlockWriter.revertPartialWritesAndClose()
    --- End diff --
    
    I'm not 100% confident that the wrapping / delegation is handled properly in all of these close methods, so I'm going to take a self-review pass to try to figure that out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195593458
  
    **[Test build #52945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52945/consoleFull)** for PR 11498 at commit [`c04d3d7`](https://github.com/apache/spark/commit/c04d3d79dca195108aca5b5e59d9944e69313f37).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #11498: [SPARK-13653] Split disk writer into separate obj...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen closed the pull request at:

    https://github.com/apache/spark/pull/11498


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13653] Split disk writer into separate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11498#issuecomment-195591761
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org