You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by li...@itri.org.tw on 2019/03/08 08:55:32 UTC

A spark streaming problem about shuffle operation

Dear all,

I am using spark streaming (2.4.0) for my project.

After nine windows are processed, the 10-th window will spend 2 seconds to process data as shown in Figure1.

When pointing into the 2019/03/08 16:28:57 window, there are two jobs as shown in Figure2 (There is only one job in the nine windows).

In Figure4, there are 64 partitions, 12 tasks, and no data is performed with any shuffle operation.

I would like to know why the spark application has such a situation in my project, that spends my extra time to process data (0.4s) as shown in Figure3.

The running environment is in my PC:
Spark Standalone mode
Execution condition:
Master/Driver node: ubuntu4
Worker nodes: ubuntu6 (4 Executors); ubuntu8 (4 Executors); ubuntu9 (4 Executors)
Number of executors: 12
Spark.default.parallelism: 12
Number of re-partitions for shuffle operation: 64

If anyone provides any direction to help us to clarify this problem and to avoid/decrease it, we would appreciate it.

Thanks.

Rick
[cid:image001.jpg@01D4D5CF.BD83B920]
Figure 1

[cid:image005.jpg@01D4D5CF.BD83B920]
Figure 2

[cid:image007.jpg@01D4D5CF.BD83B920]
Figure 3

[cid:image010.jpg@01D4D5CF.BD83B920]
[cid:image012.jpg@01D4D5CF.BD83B920]
Figure 4

--
本信件可能包含工研院機密資訊，非指定之收件者，請勿使用或揭露本信件內容，並請銷毀此信件。 This email may contain confidential information. Please do not use or disclose it in any way and delete it if you are not the intended recipient.