You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2016/01/14 02:33:53 UTC

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/10748

    [SPARK-12817] Simplify CacheManager code and remove unused BlockManager methods

    CacheManager directly calls MemoryStore.unrollSafely() and has its own logic for handling graceful fallback to disk when cached data does not fit in memory. However, this logic also exists inside of the MemoryStore itself, so this appears to be unnecessary duplication.
    
    We can remove this duplication and delete a significant amount of BlockManager code which existed only to support this CacheManager code.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark block-manager-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10748.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10748
    
----
commit 302eaa52fc98c79661cb2b17d9e95392a98b18f0
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-01-14T01:02:29Z

    Remove putArray() and duplicated unroll calls.

commit fdad4126cd0691bb70f224b7a55dc93e471dc0b9
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-01-14T01:07:02Z

    Inline MemoryStore.putArray() at callsite.

commit 5accde7d27cba46bd274b92db62981398b02608c
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-01-14T01:08:58Z

    Cleanup post-inlining.

commit 80d375a8962cb79a3a6da1f176f82c709e6a493c
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-01-14T01:10:31Z

    Remove one of the tryToPut() overloads.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171499466
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171503651
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49364/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171737226
  
    Hmm, it looks like two tests are failing:
    
    ```
    [info] - compute without caching when no partitions fit in memory *** FAILED *** (3 seconds, 905 milliseconds)
    [info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, localhost): org.apache.spark.storage.BlockException: Block manager failed to return cached value for rdd_0_0!
    ```
    
    and
    
    ```
    [info] - compute when only some partitions fit in memory *** FAILED *** (3 seconds, 893 milliseconds)
    [info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 15, localhost): org.apache.spark.storage.BlockException: Block manager failed to return cached value for rdd_0_5!
    [info] 	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:86)
    ```
    
    My hunch is that the spill-to-disk fallback path works slightly differently when called in `BlockManager.putIterator -> MemoryStore.putIterator` vs. the old code which interacted directly with the memory manager and other block manager components.
    
    I'll take a look to see if I can spot what's going on. I think that the control flow of the disk fallback path could be a bit better documented, so I'll see about adding some comments to the existing code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171525760
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171503645
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10748#discussion_r49678000
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala ---
    @@ -58,14 +57,6 @@ private[spark] class DiskStore(blockManager: BlockManager, diskManager: DiskBloc
         PutResult(bytes.limit(), Right(bytes.duplicate()))
       }
     
    -  override def putArray(
    -      blockId: BlockId,
    -      values: Array[Any],
    -      level: StorageLevel,
    -      returnValues: Boolean): PutResult = {
    -    putIterator(blockId, values.toIterator, level, returnValues)
    --- End diff --
    
    The fact that these `putArray()` methods just turned around and called `putIterator()` suggests to me that this isn't a terribly useful method to expose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10748#discussion_r49678132
  
    --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
    @@ -126,67 +136,4 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging {
           }
         }
       }
    -
    -  /**
    -   * Cache the values of a partition, keeping track of any updates in the storage statuses of
    -   * other blocks along the way.
    -   *
    -   * The effective storage level refers to the level that actually specifies BlockManager put
    -   * behavior, not the level originally specified by the user. This is mainly for forcing a
    -   * MEMORY_AND_DISK partition to disk if there is not enough room to unroll the partition,
    -   * while preserving the the original semantics of the RDD as specified by the application.
    -   */
    -  private def putInBlockManager[T](
    --- End diff --
    
    The idea behind deleting this block of code is the fact that `MemoryStore.putIterator` seems to implement essentially the same logic, using `unrollSafely` and falling back to disk when necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171512719
  
    **[Test build #49370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49370/consoleFull)** for PR 10748 at commit [`80d375a`](https://github.com/apache/spark/commit/80d375a8962cb79a3a6da1f176f82c709e6a493c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen closed the pull request at:

    https://github.com/apache/spark/pull/10748


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10748#discussion_r49678224
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
    @@ -333,15 +322,6 @@ private[spark] class MemoryStore(blockManager: BlockManager, memoryManager: Memo
         blockId.asRDDId.map(_.rddId)
       }
     
    -  private def tryToPut(
    --- End diff --
    
    I dropped this overload so that it's easier to grep around the code in IntelliJ.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171525762
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49370/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171499467
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49361/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171525696
  
    **[Test build #49370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49370/consoleFull)** for PR 10748 at commit [`80d375a`](https://github.com/apache/spark/commit/80d375a8962cb79a3a6da1f176f82c709e6a493c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-174377653
  
    Going to close this for now to declutter the queue but will re-open as soon as my other patch is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10748#discussion_r49678256
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
    @@ -170,9 +153,15 @@ private[spark] class MemoryStore(blockManager: BlockManager, memoryManager: Memo
         unrolledValues match {
           case Left(arrayValues) =>
             // Values are fully unrolled in memory, so store them as an array
    -        val res = putArray(blockId, arrayValues, level, returnValues)
    -        droppedBlocks ++= res.droppedBlocks
    -        PutResult(res.size, res.data, droppedBlocks)
    +        if (level.deserialized) {
    --- End diff --
    
    This is an inlining of the previous `putArray` code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171510426
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171494895
  
    /cc @andrewor14 for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10748#discussion_r49678291
  
    --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
    @@ -75,7 +75,17 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging {
     
               // Otherwise, cache the values and keep track of any updates in block statuses
               val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
    -          val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)
    +          val cachedValues = {
    --- End diff --
    
    This code is moved from the old `putInBlockManager` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10748#discussion_r49678166
  
    --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
    @@ -126,67 +136,4 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging {
           }
         }
       }
    -
    -  /**
    -   * Cache the values of a partition, keeping track of any updates in the storage statuses of
    -   * other blocks along the way.
    -   *
    -   * The effective storage level refers to the level that actually specifies BlockManager put
    -   * behavior, not the level originally specified by the user. This is mainly for forcing a
    -   * MEMORY_AND_DISK partition to disk if there is not enough room to unroll the partition,
    -   * while preserving the the original semantics of the RDD as specified by the application.
    -   */
    -  private def putInBlockManager[T](
    -      key: BlockId,
    -      values: Iterator[T],
    -      level: StorageLevel,
    -      updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
    -      effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {
    -
    -    val putLevel = effectiveStorageLevel.getOrElse(level)
    -    if (!putLevel.useMemory) {
    -      /*
    -       * This RDD is not to be cached in memory, so we can just pass the computed values as an
    -       * iterator directly to the BlockManager rather than first fully unrolling it in memory.
    -       */
    -      updatedBlocks ++=
    -        blockManager.putIterator(key, values, level, tellMaster = true, effectiveStorageLevel)
    -      blockManager.get(key) match {
    -        case Some(v) => v.data.asInstanceOf[Iterator[T]]
    -        case None =>
    -          logInfo(s"Failure to store $key")
    -          throw new BlockException(key, s"Block manager failed to return cached value for $key!")
    -      }
    -    } else {
    -      /*
    -       * This RDD is to be cached in memory. In this case we cannot pass the computed values
    -       * to the BlockManager as an iterator and expect to read it back later. This is because
    -       * we may end up dropping a partition from memory store before getting it back.
    -       *
    -       * In addition, we must be careful to not unroll the entire partition in memory at once.
    -       * Otherwise, we may cause an OOM exception if the JVM does not have enough space for this
    -       * single partition. Instead, we unroll the values cautiously, potentially aborting and
    -       * dropping the partition to disk if applicable.
    -       */
    -      blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match {
    -        case Left(arr) =>
    -          // We have successfully unrolled the entire partition, so cache it in memory
    -          updatedBlocks ++=
    -            blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel)
    --- End diff --
    
    The `BlockManager.putArray()` method was only called from here, hence the cleanup of those methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/10748#issuecomment-171499501
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12817] Simplify CacheManager code and r...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10748#discussion_r49772294
  
    --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
    @@ -126,67 +136,4 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging {
           }
         }
       }
    -
    -  /**
    -   * Cache the values of a partition, keeping track of any updates in the storage statuses of
    -   * other blocks along the way.
    -   *
    -   * The effective storage level refers to the level that actually specifies BlockManager put
    -   * behavior, not the level originally specified by the user. This is mainly for forcing a
    -   * MEMORY_AND_DISK partition to disk if there is not enough room to unroll the partition,
    -   * while preserving the the original semantics of the RDD as specified by the application.
    -   */
    -  private def putInBlockManager[T](
    -      key: BlockId,
    -      values: Iterator[T],
    -      level: StorageLevel,
    -      updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
    -      effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {
    -
    -    val putLevel = effectiveStorageLevel.getOrElse(level)
    -    if (!putLevel.useMemory) {
    -      /*
    -       * This RDD is not to be cached in memory, so we can just pass the computed values as an
    -       * iterator directly to the BlockManager rather than first fully unrolling it in memory.
    -       */
    -      updatedBlocks ++=
    -        blockManager.putIterator(key, values, level, tellMaster = true, effectiveStorageLevel)
    -      blockManager.get(key) match {
    -        case Some(v) => v.data.asInstanceOf[Iterator[T]]
    -        case None =>
    -          logInfo(s"Failure to store $key")
    -          throw new BlockException(key, s"Block manager failed to return cached value for $key!")
    -      }
    -    } else {
    -      /*
    -       * This RDD is to be cached in memory. In this case we cannot pass the computed values
    -       * to the BlockManager as an iterator and expect to read it back later. This is because
    -       * we may end up dropping a partition from memory store before getting it back.
    --- End diff --
    
    This problem can be addressed via my other patch for locking in the block manager: we can have a put() implicitly retain a lock to the block which was just stored.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org