You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "JESSE CHEN (JIRA)" <ji...@apache.org> on 2016/04/01 18:43:25 UTC

[jira] [Closed] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

     [ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

JESSE CHEN closed SPARK-14318.
------------------------------
    Resolution: Not A Problem

> TPCDS query 14 causes Spark SQL to hang
> ---------------------------------------
>
>                 Key: SPARK-14318
>                 URL: https://issues.apache.org/jira/browse/SPARK-14318
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0, 2.0.0
>            Reporter: JESSE CHEN
>              Labels: hangs
>         Attachments: threaddump-1459461915668.tdump
>
>
> TPCDS Q14 parses successfully, and plans created successfully. Spark tries to run (I used only 1GB text file), but "hangs". Tasks are extremely slow to process AND all CPUs are used 100% by the executor JVMs.
> It is very easy to reproduce:
> 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB text file (assuming you know how to generate the csv data). My command is like this:
> {noformat}
> /TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 8g --num-executors 4 --executor-cores 4 --conf spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
> {noformat}
> The Spark console output:
> {noformat}
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
> 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on executor id: 2 hostname: bigaperf137.svl.ibm.com.
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
> {noformat}
> Notice that time durations between tasks are unusually long: 2~5 minutes.
> When looking at the Linux 'perf' tool, two top CPU consumers are:
>     86.48%        java  [unknown]   
>     12.41%        libjvm.so
> Using the Java hotspot profiling tools, I am able to show what hotspot methods are (top 5):
> {noformat}
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten()	46.845276	9,654,179 ms (46.8%)	9,654,179 ms	9,654,179 ms	9,654,179 ms
> org.apache.spark.unsafe.Platform.copyMemory()	18.631157	3,848,442 ms (18.6%)	3,848,442 ms	3,848,442 ms	3,848,442 ms
> org.apache.spark.util.collection.CompactBuffer.$plus$eq()	6.8570185	1,418,411 ms (6.9%)	1,418,411 ms	1,517,960 ms	1,517,960 ms
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue()	4.6126328	955,495 ms (4.6%)	955,495 ms	2,153,910 ms	2,153,910 ms
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()	4.581077	949,930 ms (4.6%)	949,930 ms	19,967,510 ms	19,967,510 ms
> {noformat}
> So as you can see, the test has been running for 1.5 hours...with 46% CPU spent in the 
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 
> The stacks for top two are:
> {noformat}
> Marshalling	
> I
> java/io/DataOutputStream.writeInt() line 197
> org.​apache.​spark.​sql	
> I
> org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() line 60
> org.​apache.​spark.​storage	
> I
> org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
> org.​apache.​spark.​shuffle	
> I
> org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
> org.​apache.​spark.​scheduler	
> I
> org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
> I
> org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
> I
> org/apache/spark/scheduler/Task.run() line 82
> org.​apache.​spark.​executor	
> I
> org/apache/spark/executor/Executor$TaskRunner.run() line 231
> Dispatching Overhead,​ Standard Library Worker Dispatching	
> I
> java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
> I
> java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
> I
> java/lang/Thread.run() line 745
> {noformat}
> and 
> {noformat}
> org.​apache.​spark.​unsafe	
> I
> org/apache/spark/unsafe/Platform.copyMemory() line 172
> org.​apache.​spark.​sql	
> I
> org/apache/spark/sql/catalyst/expressions/UnsafeRow.copy() line 504
> Class Loading	
> I
> org/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator.processNext()
> org.​apache.​spark.​sql	
> I
> org/apache/spark/sql/execution/BufferedRowIterator.hasNext() line 41
> I
> org/apache/spark/sql/execution/WholeStageCodegen$$anonfun$doExecute$2$$anon$2.hasNext() line 375
> scala.​collection	
> I
> scala/collection/Iterator$$anon$11.hasNext() line 369
> org.​apache.​spark.​shuffle	
> I
> org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 147
> org.​apache.​spark.​scheduler	
> I
> org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
> I
> org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
> I
> org/apache/spark/scheduler/Task.run() line 82
> org.​apache.​spark.​executor	
> I
> org/apache/spark/executor/Executor$TaskRunner.run() line 231
> Dispatching Overhead,​ Standard Library Worker Dispatching	
> I
> java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
> I
> java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
> I
> java/lang/Thread.run() line 745
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org