You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jon Chase (JIRA)" <ji...@apache.org> on 2015/04/16 18:15:59 UTC

[jira] [Commented] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

    [ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498232#comment-14498232 ] 

Jon Chase commented on SPARK-6962:
----------------------------------

I think it's different from SPARK-4395, as calling/omitting .cache() doesn't have any effect.  Also, once it hangs, I've never seen it finish (even after waiting many hours).  

Also different from SPARK-5060, I believe, as the web UI accurately reports the remaining tasks as unfinished (or in progress in case of the ones running when the hang occurs).  

Here's my original post from the email thread:

===========================================

Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB),
executor memory 20GB, driver memory 10GB

I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread
out over roughly 2,000 Parquet files and my queries frequently hang. Simple
queries like "select count(*) from ..." on the entire data set work ok.
Slightly more demanding ones with group by's and some aggregate functions
(percentile_approx, avg, etc.) work ok as well, as long as I have some
criteria in my where clause to keep the number of rows down.

Once I hit some limit on query complexity and rows processed, my queries
start to hang.  I've left them for up to an hour without seeing any
progress.  No OOM's either - the job is just stuck.

I've tried setting spark.sql.shuffle.partitions to 400 and even 800, but
with the same results: usually near the end of the tasks (like 780 of 800
complete), progress just stops:

15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 788.0 in
stage 1.0 (TID 1618) in 800 ms on
ip-10-209-22-211.eu-west-1.compute.internal (748/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 793.0 in
stage 1.0 (TID 1623) in 622 ms on
ip-10-105-12-41.eu-west-1.compute.internal (749/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 797.0 in
stage 1.0 (TID 1627) in 616 ms on ip-10-90-2-201.eu-west-1.compute.internal
(750/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 799.0 in
stage 1.0 (TID 1629) in 611 ms on ip-10-90-2-201.eu-west-1.compute.internal
(751/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 795.0 in
stage 1.0 (TID 1625) in 669 ms on
ip-10-105-12-41.eu-west-1.compute.internal (752/800)

^^^^^^^ this is where it stays forever

Looking at the Spark UI, several of the executors still list active tasks.
I do see that the Shuffle Read for executors that don't have any tasks
remaining is around 100MB, whereas it's more like 10MB for the executors
that still have tasks.

The first stage, mapPartitions, always completes fine.  It's the second
stage (takeOrdered), that hangs.

I've had this issue in 1.2.0 and 1.2.1 as well as 1.3.0.  I've also
encountered it when using JSON files (instead of Parquet).


> Spark gets stuck on a step, hangs forever - jobs do not complete
> ----------------------------------------------------------------
>
>                 Key: SPARK-6962
>                 URL: https://issues.apache.org/jira/browse/SPARK-6962
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0, 1.2.1, 1.3.0
>            Reporter: Jon Chase
>
> Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances.  
> This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue.  Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started.  Setting the service to nio allows queries to complete normally.
> I do not see this problem when running queries over smaller (~20 5MB files) datasets.  When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang  indefinitely.
> Here's the email chain regarding this issue, including stack traces:
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/<CA...@mail.gmail.com>
> For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/<CA...@mail.gmail.com>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org