You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Malte Buecken (JIRA)" <ji...@apache.org> on 2014/10/15 23:15:34 UTC

[jira] [Commented] (SPARK-2681) Spark can hang when fetching shuffle blocks

    [ https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172959#comment-14172959 ] 

Malte Buecken commented on SPARK-2681:
--------------------------------------

I am still seeing this issue in 1.1.0

14/10/15 11:20:27 INFO MapOutputTrackerWorker: Got the output locations
14/10/15 11:20:48 INFO ConnectionManager: Accepted connection from [host:59365]
14/10/15 11:20:48 INFO SendingConnection: Initiating connection to [host:44256]
14/10/15 11:20:48 INFO SendingConnection: Connected to [host:44256], 1 messages pending
14/10/15 11:23:43 INFO ConnectionManager: Accepted connection from [host:50444]
14/10/15 11:23:43 INFO SendingConnection: Initiating connection to [host:55769]
14/10/15 11:23:43 INFO SendingConnection: Connected to [host5:55769], 1 messages pending
14/10/15 11:54:52 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@e2e30ea
14/10/15 11:54:52 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(host,55769)
14/10/15 11:54:52 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(host,55769)
14/10/15 11:54:52 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(host,55769)
14/10/15 11:54:52 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(host,55769)
14/10/15 11:54:52 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(host,55769) not found
14/10/15 11:54:52 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(host,55769) not found
14/10/15 11:54:52 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@e2e30ea
java.nio.channels.CancelledKeyException
	at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
	at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/15 11:54:52 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@1968265a
14/10/15 11:54:52 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@1968265a
java.nio.channels.CancelledKeyException
	at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
	at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/15 11:55:08 INFO BlockManager: Removing broadcast 95
14/10/15 11:55:32 INFO BlockManager: Removing broadcast 96
14/10/15 11:55:56 INFO BlockManager: Removing broadcast 98

> Spark can hang when fetching shuffle blocks
> -------------------------------------------
>
>                 Key: SPARK-2681
>                 URL: https://issues.apache.org/jira/browse/SPARK-2681
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Guoqiang Li
>            Priority: Blocker
>         Attachments: jstack-26027.log
>
>
> executor log :
> {noformat}
> 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 53628
> 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
> 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236 and clearing cache
> 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found, computing it
> 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them
> 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
> 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations
> 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
> 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 non-empty blocks out of 1024 blocks
> 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote fetches in 8 ms
> 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called with curMem=920109320, maxMem=4322230272
> 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values to memory (estimated size 28.1 KB, free 3.2 GB)
> 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block rdd_51_83
> 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found, computing it
> 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 28, fetching them
> 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
> 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations
> 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
> 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1024 blocks
> 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote fetches in 0 ms
> 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found, computing it
> 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
> 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 non-empty blocks out of 1024 blocks
> 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote fetches in 4 ms
> 14/07/24 22:57:09 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(tuan221,51153)
> 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(tuan221,51153)
> 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(tuan221,51153)
> 14/07/24 23:05:07 INFO network.ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@3dcc1da1
> 14/07/24 23:05:07 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(tuan211,43828)
> 14/07/24 23:05:07 INFO network.ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@3dcc1da1
> java.nio.channels.CancelledKeyException
> 	at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
> 	at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
> 14/07/24 23:05:07 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(tuan211,43828)
> 14/07/24 23:05:07 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org