You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viper-kun <gi...@git.apache.org> on 2016/03/22 13:51:08 UTC

[GitHub] spark pull request: Increase probability of using cached serialize...

GitHub user viper-kun opened a pull request:

    https://github.com/apache/spark/pull/11886

    Increase probability of using cached serialized status

    Increase probability of using cached serialized status
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viper-kun/spark patch-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11886.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11886
    
----
commit 32ad0d57becd216f1ad9e835ee6e4db527b3ab20
Author: xukun <xu...@huawei.com>
Date:   2016-03-22T12:44:34Z

    Increase probability of using cached serialized status

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11886#discussion_r56989619
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -443,13 +443,12 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf)
               statuses = mapStatuses.getOrElse(shuffleId, Array[MapStatus]())
               epochGotten = epoch
           }
    -    }
    -    // If we got here, we failed to find the serialized locations in the cache, so we pulled
    -    // out a snapshot of the locations as "statuses"; let's serialize and return that
    -    val bytes = MapOutputTracker.serializeMapStatuses(statuses)
    -    logInfo("Size of output statuses for shuffle %d is %d bytes".format(shuffleId, bytes.length))
    -    // Add them into the table only if the epoch hasn't changed while we were working
    -    epochLock.synchronized {
    +      
    +      // If we got here, we failed to find the serialized locations in the cache, so we pulled
    --- End diff --
    
    Are you sure this is related to your problem? you've put the serialization into code holding the lock now, which has a downside. I'm not clear from your description what the problem is. What do you mean by "serialize Mapstatus is in serial model from different shuffle stage."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-211886837
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-213825430
  
    **[Test build #56806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56806/consoleFull)** for PR 11886 at commit [`8f8133c`](https://github.com/apache/spark/commit/8f8133c52dfdbfdd1176f9e2764d3b9e3f921e70).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-217152303
  
    Note  this is fixed by https://github.com/apache/spark/pull/12113 which does a lot more for the map output statuses, it would be great to get that in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-213825500
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-211886526
  
    **[Test build #56222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56222/consoleFull)** for PR 11886 at commit [`3b2e8e9`](https://github.com/apache/spark/commit/3b2e8e9853c1ba9be5010aff368e39fe67727bbe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-216026123
  
    @viper-kun what do you think? I'm not as sure of the impact if this now prevents two cached statuses from being read at once.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-199797291
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-216701518
  
    The high level fix makes sense. I think the right approach here would be to have a read write lock, where all reads are concurrent but the writes block everything. However, that's probably overkill here so I'm OK with the code as is.
    
    Are there any potential sources of regressions? Now we hold a lock while serializing, which could take a while.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by viper-kun <gi...@git.apache.org>.
Github user viper-kun commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-218140398
  
    @tgravescs  ok, i close it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by viper-kun <gi...@git.apache.org>.
Github user viper-kun closed the pull request at:

    https://github.com/apache/spark/pull/11886


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11886#discussion_r60841016
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -431,6 +431,7 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf)
       def getSerializedMapOutputStatuses(shuffleId: Int): Array[Byte] = {
         var statuses: Array[MapStatus] = null
         var epochGotten: Long = -1
    +    var byteArr: Array[Byte] = null
    --- End diff --
    
    I think this is OK, but this can still go in the synchronized block and a few lines down, to keep the scope narrower. I think it's fairly OK as it though.
    
    This certainly solves the problem you identified. It does mean more work is done while holding the lock, so that no other threads can retrieve cached statuses while this is happening. I assume that's why it was written that way to begin with, even if it causes the problem you identify.
    
    I wonder if there is any decent way to rewrite this to manage that better? it would require some object to synchronize on per serialized status so that only one thread can retrieve it. Easily solved with a bit of redirection, like holding a simple AtomicReference to the byte array and locking on the former.
    
    Worth it or over-thinking it? I wouldn't mind hearing a though from @andrewor14 or @vanzin if they have instincts on this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-213800108
  
    Jenkins add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11886#discussion_r62184963
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -442,19 +443,19 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf)
             case None =>
               statuses = mapStatuses.getOrElse(shuffleId, Array[MapStatus]())
               epochGotten = epoch
    +          // If we got here, we failed to find the serialized locations in the cache, so we pulled
    +          // out a snapshot of the locations as "statuses"; let's serialize and return that
    +          byteArr = MapOutputTracker.serializeMapStatuses(statuses)
    +          logInfo("Size of output statuses for shuffle %d is %d bytes"
    +            .format(shuffleId, byteArr.length))
    --- End diff --
    
    This actually is not ok because the synchronized block is going to block the dispatcher threads which could then cause heartbeats and other messages to not be processed.  For small things its fine but once you get to larger ones you would have issues.
    
    See PR https://github.com/apache/spark/pull/12113 which fixes issues with large map output statuses and I believe fixes this same issue because only one thread will serialize and the rest will use the cached version..  Note that there are still improvements we could make with regards to caching. We could be more proactive to cache things up front, but since the cached statuses are all cleared when something changes you have to know which ones to cache again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11886#discussion_r61975476
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -431,6 +431,7 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf)
       def getSerializedMapOutputStatuses(shuffleId: Int): Array[Byte] = {
         var statuses: Array[MapStatus] = null
         var epochGotten: Long = -1
    +    var byteArr: Array[Byte] = null
    --- End diff --
    
    I left a comment on the main thread


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by viper-kun <gi...@git.apache.org>.
Github user viper-kun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11886#discussion_r57102821
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -443,13 +443,12 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf)
               statuses = mapStatuses.getOrElse(shuffleId, Array[MapStatus]())
               epochGotten = epoch
           }
    -    }
    -    // If we got here, we failed to find the serialized locations in the cache, so we pulled
    -    // out a snapshot of the locations as "statuses"; let's serialize and return that
    -    val bytes = MapOutputTracker.serializeMapStatuses(statuses)
    -    logInfo("Size of output statuses for shuffle %d is %d bytes".format(shuffleId, bytes.length))
    -    // Add them into the table only if the epoch hasn't changed while we were working
    -    epochLock.synchronized {
    +      
    +      // If we got here, we failed to find the serialized locations in the cache, so we pulled
    --- End diff --
    
    The problem description:
    Execute the query listed in jira(https://issues.apache.org/jira/browse/SPARK-14065), all tasks running on some executor are slow.
    
    Slow executor logs show that RPC(GetMapOutputStatuses) get  RpcTimeoutException
    ```
    Error sending message [message = GetMapOutputStatuses(1)] in 1 attempts | org.apache.spark.Logging$class.logWarning(Logging.scala:92)
    org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.network.timeout
    ```
    
    Driver logs shows that serialize mapstatus is slow
    ```
    16/03/22 11:47:07 INFO [dispatcher-event-loop-36] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    16/03/22 11:47:14 INFO [dispatcher-event-loop-30] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    16/03/22 11:47:21 INFO [dispatcher-event-loop-3] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    16/03/22 11:47:27 INFO [dispatcher-event-loop-32] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    16/03/22 11:47:34 INFO [dispatcher-event-loop-31] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    16/03/22 11:47:41 INFO [dispatcher-event-loop-38] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    16/03/22 11:47:47 INFO [dispatcher-event-loop-4] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    16/03/22 11:47:54 INFO [dispatcher-event-loop-37] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    16/03/22 11:48:00 INFO [dispatcher-event-loop-28] MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 10298063 bytes
    ```
    
    When reduce task start, all executors will get mapstatus from driver.Because "MapOutputTracker.serializeMapStatuses(statuses)" is out of epochLock.synchronized {}
    ,all thread will do the operation - MapOutputTracker.serializeMapStatuses(statuses). In function serializeMapStatuses,  it has sync on statuses. So Serialize is one by one. 
    Every serialize cost 7 seconds. We have 80 executors, it total cost 560 seconds. The result is  some executor get mapstatus timeout.
    This patch put "MapOutputTracker.serializeMapStatuses(statuses)" into epochLock.synchronized {}, it will increase probability of using cached serialized status.
    
    I have test  listed sql in 30T tpcds. The result shows it  faster than old. 
    ![image](https://cloud.githubusercontent.com/assets/6460155/13974069/9f448cc4-f0e3-11e5-9a5d-bc3cd8300db1.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by viper-kun <gi...@git.apache.org>.
Github user viper-kun commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-204319818
  
    @srowen  The logic is the same to me.If necessary, I will put  serializeMapStatus into  match...None.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-208322056
  
    Ping @viper-kun 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-213800102
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-211886833
  
    **[Test build #56222 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56222/consoleFull)** for PR 11886 at commit [`3b2e8e9`](https://github.com/apache/spark/commit/3b2e8e9853c1ba9be5010aff368e39fe67727bbe).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-211884902
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-206391809
  
    I personally would move this into the block above. The logic seems to be more obvious that way, since the logic to fill the missing cache entry is then in the branch where there is nothing in the cache. It should be equivalent logically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11886#discussion_r57118942
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -443,13 +443,12 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf)
               statuses = mapStatuses.getOrElse(shuffleId, Array[MapStatus]())
               epochGotten = epoch
           }
    -    }
    -    // If we got here, we failed to find the serialized locations in the cache, so we pulled
    -    // out a snapshot of the locations as "statuses"; let's serialize and return that
    -    val bytes = MapOutputTracker.serializeMapStatuses(statuses)
    -    logInfo("Size of output statuses for shuffle %d is %d bytes".format(shuffleId, bytes.length))
    -    // Add them into the table only if the epoch hasn't changed while we were working
    -    epochLock.synchronized {
    +      
    +      // If we got here, we failed to find the serialized locations in the cache, so we pulled
    --- End diff --
    
    OK, to paraphrase: everyone finds the cache is empty one by one very quickly, and everyone proceeds to load the cached value. Isn't it simpler to put this load in the `cachedSerializedStatuses.get(shuffleId) match ... None  =>` block then?
    
    This code is pretty old but CC @mateiz or @JoshRosen in case they have an opinion since they touched it last


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-211886838
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56222/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-213801214
  
    **[Test build #56806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56806/consoleFull)** for PR 11886 at commit [`8f8133c`](https://github.com/apache/spark/commit/8f8133c52dfdbfdd1176f9e2764d3b9e3f921e70).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11886#issuecomment-213825501
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56806/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org