You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by velvetbaldmime <ke...@gmail.com> on 2016/07/11 14:50:21 UTC

Spark hangs at "Removed broadcast_*"

Spark 2.0.0-preview

We've got an app that uses a fairly big broadcast variable. We run this on a
big EC2 instance, so deployment is in client-mode. Broadcasted variable is a
massive Map[String, Array[String]].

At the end of saveAsTextFile, the output in the folder seems to be complete
and correct (apart from .crc files still being there) BUT the spark-submit
process is stuck on, seemingly, removing the broadcast variable. The stuck
logs look like this: http://pastebin.com/wpTqvArY

My last run lasted for 12 hours after after doing saveAsTextFile - just
sitting there. I did a jstack on driver process, most threads are parked:
http://pastebin.com/E29JKVT7

Full store: We used this code with Spark 1.5.0 and it worked, but then the
data changed and something stopped fitting into Kryo's serialisation buffer.
Increasing it didn't help, so I had to disable the KryoSerialiser. Tested it
again - it hanged. Switched to 2.0.0-preview - seems like the same issue.

I'm not quite sure what's even going on given that there's almost no CPU
activity and no output in the logs, yet the output is not finalised like it
used to before.

Would appreciate any help, thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


回复: Spark hangs at "Removed broadcast_*"

Posted by Sea <26...@qq.com>.
please provide your jstack info.




------------------ 原始邮件 ------------------
发件人: "dhruve ashar";<dh...@gmail.com>;
发送时间: 2016年7月13日(星期三) 凌晨3:53
收件人: "Anton Sviridov"<ke...@gmail.com>; 
抄送: "user"<us...@spark.apache.org>; 
主题: Re: Spark hangs at "Removed broadcast_*"



Looking at the jstack, it seems that it doesn't contain all the threads. Cannot find the main thread in the jstack.

I am not an expert on analyzing jstacks, but are you creating any threads in your code? Shutting them down correctly?


This one is a non-daemon and doesn't seem to be coming from Spark. 
"Scheduler-2144644334" #110 prio=5 os_prio=0 tid=0x00007f8104001800 nid=0x715 waiting on condition [0x00007f812cf95000]



Also, does the shutdown hook get called? 




On Tue, Jul 12, 2016 at 2:35 AM, Anton Sviridov <ke...@gmail.com> wrote:
Hi.

Here's the last few lines before it starts removing broadcasts:


16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003209_20886' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003209
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003209_20886: Committed
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3211.0 in stage 9.0 (TID 20888) in 95 ms on localhost (3209/3214)
16/07/11 14:02:11 INFO Executor: Finished task 3209.0 in stage 9.0 (TID 20886). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3209.0 in stage 9.0 (TID 20886) in 103 ms on localhost (3210/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003208_20885' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003208
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003208_20885: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3208.0 in stage 9.0 (TID 20885). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3208.0 in stage 9.0 (TID 20885) in 109 ms on localhost (3211/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003212_20889' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003212
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003212_20889: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3212.0 in stage 9.0 (TID 20889). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3212.0 in stage 9.0 (TID 20889) in 84 ms on localhost (3212/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003210_20887' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003210
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003210_20887: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3210.0 in stage 9.0 (TID 20887). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3210.0 in stage 9.0 (TID 20887) in 100 ms on localhost (3213/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003213_20890' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003213
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003213_20890: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3213.0 in stage 9.0 (TID 20890). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3213.0 in stage 9.0 (TID 20890) in 82 ms on localhost (3214/3214)
16/07/11 14:02:11 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool
16/07/11 14:02:11 INFO DAGScheduler: ResultStage 9 (saveAsTextFile at SfCountsDumper.scala:13) finished in 42.294 s
16/07/11 14:02:11 INFO DAGScheduler: Job 1 finished: saveAsTextFile at SfCountsDumper.scala:13, took 9517.124624 s
16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.101.230.154:35192 in memory (size: 15.8 KB, free: 37.1 GB)
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 7
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 6
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 5
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 4
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 3
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 2
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 1
16/07/11 14:28:46 INFO BlockManager: Removing RDD 14
16/07/11 14:28:46 INFO ContextCleaner: Cleaned RDD 14
16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_11_piece0 on 10.101.230.154:35192 in memory (size: 25.5 KB, free: 37.1 GB)

...


In fact, the job is still running, Spark's UI shows uptime of 20.6 hours with last job finishing 18 hours ago at least.


On Mon, 11 Jul 2016 at 23:23 dhruve ashar <dh...@gmail.com> wrote:

Hi, 

Can you check the time when the job actually finished from the logs. The logs provided are too short and do not reveal meaningful information. 


  


On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime <ke...@gmail.com> wrote:
Spark 2.0.0-preview
 
 We've got an app that uses a fairly big broadcast variable. We run this on a
 big EC2 instance, so deployment is in client-mode. Broadcasted variable is a
 massive Map[String, Array[String]].
 
 At the end of saveAsTextFile, the output in the folder seems to be complete
 and correct (apart from .crc files still being there) BUT the spark-submit
 process is stuck on, seemingly, removing the broadcast variable. The stuck
 logs look like this: http://pastebin.com/wpTqvArY
 
 My last run lasted for 12 hours after after doing saveAsTextFile - just
 sitting there. I did a jstack on driver process, most threads are parked:
 http://pastebin.com/E29JKVT7
 
 Full store: We used this code with Spark 1.5.0 and it worked, but then the
 data changed and something stopped fitting into Kryo's serialisation buffer.
 Increasing it didn't help, so I had to disable the KryoSerialiser. Tested it
 again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
 
 I'm not quite sure what's even going on given that there's almost no CPU
 activity and no output in the logs, yet the output is not finalised like it
 used to before.
 
 Would appreciate any help, thanks
 
 
 
 --
 View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 ---------------------------------------------------------------------
 To unsubscribe e-mail: user-unsubscribe@spark.apache.org
 
 





-- 
-Dhruve Ashar



 

 






-- 
-Dhruve Ashar

Re: Spark hangs at "Removed broadcast_*"

Posted by dhruve ashar <dh...@gmail.com>.
Looking at the jstack, it seems that it doesn't contain all the threads.
Cannot find the main thread in the jstack.

I am not an expert on analyzing jstacks, but are you creating any threads
in your code? Shutting them down correctly?

This one is a non-daemon and doesn't seem to be coming from Spark.
*"Scheduler-2144644334"* #110 prio=5 os_prio=0 tid=0x00007f8104001800
nid=0x715 waiting on condition [0x00007f812cf95000]

Also, does the shutdown hook get called?


On Tue, Jul 12, 2016 at 2:35 AM, Anton Sviridov <ke...@gmail.com> wrote:

> Hi.
>
> Here's the last few lines before it starts removing broadcasts:
>
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003209_20886' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003209
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003209_20886: Committed
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3211.0 in stage 9.0
> (TID 20888) in 95 ms on localhost (3209/3214)
> 16/07/11 14:02:11 INFO Executor: Finished task 3209.0 in stage 9.0 (TID
> 20886). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3209.0 in stage 9.0
> (TID 20886) in 103 ms on localhost (3210/3214)
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003208_20885' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003208
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003208_20885: Committed
> 16/07/11 14:02:11 INFO Executor: Finished task 3208.0 in stage 9.0 (TID
> 20885). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3208.0 in stage 9.0
> (TID 20885) in 109 ms on localhost (3211/3214)
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003212_20889' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003212
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003212_20889: Committed
> 16/07/11 14:02:11 INFO Executor: Finished task 3212.0 in stage 9.0 (TID
> 20889). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3212.0 in stage 9.0
> (TID 20889) in 84 ms on localhost (3212/3214)
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003210_20887' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003210
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003210_20887: Committed
> 16/07/11 14:02:11 INFO Executor: Finished task 3210.0 in stage 9.0 (TID
> 20887). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3210.0 in stage 9.0
> (TID 20887) in 100 ms on localhost (3213/3214)
> 16/07/11 14:02:11 INFO FileOutputCommitter: File Output Committer
> Algorithm version is 1
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003213_20890' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003213
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003213_20890: Committed
> 16/07/11 14:02:11 INFO Executor: Finished task 3213.0 in stage 9.0 (TID
> 20890). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3213.0 in stage 9.0
> (TID 20890) in 82 ms on localhost (3214/3214)
> 16/07/11 14:02:11 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks
> have all completed, from pool
> *16/07/11 14:02:11 INFO DAGScheduler: ResultStage 9 (saveAsTextFile at
> SfCountsDumper.scala:13) finished in 42.294 s*
> *16/07/11 14:02:11 INFO DAGScheduler: Job 1 finished: saveAsTextFile at
> SfCountsDumper.scala:13, took 9517.124624 s*
> 16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_0_piece0 on
> 10.101.230.154:35192 in memory (size: 15.8 KB, free: 37.1 GB)
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 7
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 6
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 5
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 4
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 3
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 2
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 1
> 16/07/11 14:28:46 INFO BlockManager: Removing RDD 14
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned RDD 14
> 16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_11_piece0 on
> 10.101.230.154:35192 in memory (size: 25.5 KB, free: 37.1 GB)
> ...
>
> In fact, the job is still running, Spark's UI shows uptime of 20.6 hours
> with last job finishing 18 hours ago at least.
>
> On Mon, 11 Jul 2016 at 23:23 dhruve ashar <dh...@gmail.com> wrote:
>
>> Hi,
>>
>> Can you check the time when the job actually finished from the logs. The
>> logs provided are too short and do not reveal meaningful information.
>>
>>
>>
>> On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime <ke...@gmail.com>
>> wrote:
>>
>>> Spark 2.0.0-preview
>>>
>>> We've got an app that uses a fairly big broadcast variable. We run this
>>> on a
>>> big EC2 instance, so deployment is in client-mode. Broadcasted variable
>>> is a
>>> massive Map[String, Array[String]].
>>>
>>> At the end of saveAsTextFile, the output in the folder seems to be
>>> complete
>>> and correct (apart from .crc files still being there) BUT the
>>> spark-submit
>>> process is stuck on, seemingly, removing the broadcast variable. The
>>> stuck
>>> logs look like this: http://pastebin.com/wpTqvArY
>>>
>>> My last run lasted for 12 hours after after doing saveAsTextFile - just
>>> sitting there. I did a jstack on driver process, most threads are parked:
>>> http://pastebin.com/E29JKVT7
>>>
>>> Full store: We used this code with Spark 1.5.0 and it worked, but then
>>> the
>>> data changed and something stopped fitting into Kryo's serialisation
>>> buffer.
>>> Increasing it didn't help, so I had to disable the KryoSerialiser.
>>> Tested it
>>> again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
>>>
>>> I'm not quite sure what's even going on given that there's almost no CPU
>>> activity and no output in the logs, yet the output is not finalised like
>>> it
>>> used to before.
>>>
>>> Would appreciate any help, thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>> -Dhruve Ashar
>>
>>


-- 
-Dhruve Ashar

Re: Spark hangs at "Removed broadcast_*"

Posted by Anton Sviridov <ke...@gmail.com>.
Hi.

Here's the last few lines before it starts removing broadcasts:

16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
'attempt_201607111123_0009_m_003209_20886' to
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003209
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
attempt_201607111123_0009_m_003209_20886: Committed
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3211.0 in stage 9.0
(TID 20888) in 95 ms on localhost (3209/3214)
16/07/11 14:02:11 INFO Executor: Finished task 3209.0 in stage 9.0 (TID
20886). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3209.0 in stage 9.0
(TID 20886) in 103 ms on localhost (3210/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
'attempt_201607111123_0009_m_003208_20885' to
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003208
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
attempt_201607111123_0009_m_003208_20885: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3208.0 in stage 9.0 (TID
20885). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3208.0 in stage 9.0
(TID 20885) in 109 ms on localhost (3211/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
'attempt_201607111123_0009_m_003212_20889' to
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003212
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
attempt_201607111123_0009_m_003212_20889: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3212.0 in stage 9.0 (TID
20889). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3212.0 in stage 9.0
(TID 20889) in 84 ms on localhost (3212/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
'attempt_201607111123_0009_m_003210_20887' to
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003210
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
attempt_201607111123_0009_m_003210_20887: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3210.0 in stage 9.0 (TID
20887). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3210.0 in stage 9.0
(TID 20887) in 100 ms on localhost (3213/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: File Output Committer Algorithm
version is 1
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
'attempt_201607111123_0009_m_003213_20890' to
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003213
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
attempt_201607111123_0009_m_003213_20890: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3213.0 in stage 9.0 (TID
20890). 1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3213.0 in stage 9.0
(TID 20890) in 82 ms on localhost (3214/3214)
16/07/11 14:02:11 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks
have all completed, from pool
*16/07/11 14:02:11 INFO DAGScheduler: ResultStage 9 (saveAsTextFile at
SfCountsDumper.scala:13) finished in 42.294 s*
*16/07/11 14:02:11 INFO DAGScheduler: Job 1 finished: saveAsTextFile at
SfCountsDumper.scala:13, took 9517.124624 s*
16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_0_piece0 on
10.101.230.154:35192 in memory (size: 15.8 KB, free: 37.1 GB)
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 7
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 6
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 5
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 4
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 3
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 2
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 1
16/07/11 14:28:46 INFO BlockManager: Removing RDD 14
16/07/11 14:28:46 INFO ContextCleaner: Cleaned RDD 14
16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_11_piece0 on
10.101.230.154:35192 in memory (size: 25.5 KB, free: 37.1 GB)
...

In fact, the job is still running, Spark's UI shows uptime of 20.6 hours
with last job finishing 18 hours ago at least.

On Mon, 11 Jul 2016 at 23:23 dhruve ashar <dh...@gmail.com> wrote:

> Hi,
>
> Can you check the time when the job actually finished from the logs. The
> logs provided are too short and do not reveal meaningful information.
>
>
>
> On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime <ke...@gmail.com> wrote:
>
>> Spark 2.0.0-preview
>>
>> We've got an app that uses a fairly big broadcast variable. We run this
>> on a
>> big EC2 instance, so deployment is in client-mode. Broadcasted variable
>> is a
>> massive Map[String, Array[String]].
>>
>> At the end of saveAsTextFile, the output in the folder seems to be
>> complete
>> and correct (apart from .crc files still being there) BUT the spark-submit
>> process is stuck on, seemingly, removing the broadcast variable. The stuck
>> logs look like this: http://pastebin.com/wpTqvArY
>>
>> My last run lasted for 12 hours after after doing saveAsTextFile - just
>> sitting there. I did a jstack on driver process, most threads are parked:
>> http://pastebin.com/E29JKVT7
>>
>> Full store: We used this code with Spark 1.5.0 and it worked, but then the
>> data changed and something stopped fitting into Kryo's serialisation
>> buffer.
>> Increasing it didn't help, so I had to disable the KryoSerialiser. Tested
>> it
>> again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
>>
>> I'm not quite sure what's even going on given that there's almost no CPU
>> activity and no output in the logs, yet the output is not finalised like
>> it
>> used to before.
>>
>> Would appreciate any help, thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> -Dhruve Ashar
>
>

Re: Spark hangs at "Removed broadcast_*"

Posted by dhruve ashar <dh...@gmail.com>.
Hi,

Can you check the time when the job actually finished from the logs. The
logs provided are too short and do not reveal meaningful information.



On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime <ke...@gmail.com> wrote:

> Spark 2.0.0-preview
>
> We've got an app that uses a fairly big broadcast variable. We run this on
> a
> big EC2 instance, so deployment is in client-mode. Broadcasted variable is
> a
> massive Map[String, Array[String]].
>
> At the end of saveAsTextFile, the output in the folder seems to be complete
> and correct (apart from .crc files still being there) BUT the spark-submit
> process is stuck on, seemingly, removing the broadcast variable. The stuck
> logs look like this: http://pastebin.com/wpTqvArY
>
> My last run lasted for 12 hours after after doing saveAsTextFile - just
> sitting there. I did a jstack on driver process, most threads are parked:
> http://pastebin.com/E29JKVT7
>
> Full store: We used this code with Spark 1.5.0 and it worked, but then the
> data changed and something stopped fitting into Kryo's serialisation
> buffer.
> Increasing it didn't help, so I had to disable the KryoSerialiser. Tested
> it
> again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
>
> I'm not quite sure what's even going on given that there's almost no CPU
> activity and no output in the logs, yet the output is not finalised like it
> used to before.
>
> Would appreciate any help, thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
-Dhruve Ashar