You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "TheGeorge1918 ." <zh...@gmail.com> on 2017/03/08 13:45:32 UTC

spark executor memory, jvm config

Hello all,

I was running some spark job and some executors failed without error info.
The executors were dead and new executors were requested but on the spark
web UI,  no failure found. Normally, if it's memory issue, I could find OOM
ther, but not this time.

Configuration:
1. each executor has 25G memory
2. garbage collection: G1GC

From the garbage collector information (please see below), it looks like
the executor still has heap space. So, I'm not so sure how should I
improved it.

One thing I can try is to configure Metaspace,
e.g. -XX:MaxMetaspaceSize=256m. Any other suggestions? Thanks a lot.

Best

Xuan

4731.701: [GC pause (G1 Evacuation Pause) (young), 0.0965042 secs]
   [Parallel Time: 85.5 ms, GC Workers: 28]
      [GC Worker Start (ms): Min: 4731702.0, Avg: 4731722.0, Max:
4731746.6, Diff: 44.6]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 1.9, Diff: 1.8,
Sum: 6.4]
      [Update RS (ms): Min: 0.0, Avg: 22.1, Max: 40.8, Diff: 40.8, Sum:
617.7]
         [Processed Buffers: Min: 0, Avg: 58.5, Max: 195, Diff: 195, Sum:
1639]
      [Scan RS (ms): Min: 0.1, Avg: 1.2, Max: 1.9, Diff: 1.8, Sum: 34.1]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1,
Sum: 1.3]
      [Object Copy (ms): Min: 36.4, Avg: 38.0, Max: 41.6, Diff: 5.1, Sum:
1065.0]
      [Termination (ms): Min: 0.0, Avg: 3.3, Max: 3.6, Diff: 3.6, Sum: 91.7]
         [Termination Attempts: Min: 1, Avg: 29.9, Max: 45, Diff: 44, Sum:
837]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.3, Max: 0.6, Diff: 0.5, Sum:
8.1]
      [GC Worker Total (ms): Min: 40.4, Avg: 65.1, Max: 85.4, Diff: 45.0,
Sum: 1824.2]
      [GC Worker End (ms): Min: 4731786.9, Avg: 4731787.2, Max: 4731787.4,
Diff: 0.5]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.5 ms]
   [Other: 9.4 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.0 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.6 ms]
      [Humongous Reclaim: 1.1 ms]
      [Free CSet: 5.0 ms]
   *[Eden: 8280.0M(1136.0M)->0.0B(14.9G) Survivors: 144.0M->64.0M Heap:
10.5G(25.0G)->2021.9M(25.0G)]*
 [Times: user=1.46 sys=0.01, real=0.09 secs]
Heap
 garbage-first heap   total 26214400K, used 11822248K [0x0000000180000000,
0x000000018040c800, 0x00000007c0000000)
  region size 4096K, 2325 young (9523200K), 16 survivors (65536K)
 Metaspace       used 76023K, capacity 77317K, committed 77616K, reserved
1118208K
  class space    used 8649K, capacity 8958K, committed 9008K, reserved
1048576K
 Concurrent marking:
      0   init marks: total time =     0.00 s (avg =     0.00 ms).
     10      remarks: total time =     0.70 s (avg =    70.35 ms).
           [std. dev =    51.84 ms, max =   159.84 ms]
        10  final marks: total time =     0.03 s (avg =     3.43 ms).
              [std. dev =     2.84 ms, max =     9.66 ms]
        10    weak refs: total time =     0.67 s (avg =    66.92 ms).
              [std. dev =    52.19 ms, max =   157.75 ms]
     10     cleanups: total time =     0.31 s (avg =    30.59 ms).
           [std. dev =    24.03 ms, max =    71.07 ms]
    Final counting total time =     0.05 s (avg =     5.13 ms).
    RS scrub total time =     0.18 s (avg =    18.46 ms).
  Total stop_world time =     1.01 s.
  Total concurrent time =    42.94 s (   42.06 s marking).

Re: spark executor memory, jvm config

Posted by "TheGeorge1918 ." <zh...@gmail.com>.
OK, I found the problem. There is a typo in my configuration. As a result,
the executor dynamic allocation is not disabled. So, the executors get
killed and requested from time to time. All good now.

On Wed, Mar 8, 2017 at 2:45 PM, TheGeorge1918 . <zh...@gmail.com>
wrote:

> Hello all,
>
> I was running some spark job and some executors failed without error info.
> The executors were dead and new executors were requested but on the spark
> web UI,  no failure found. Normally, if it's memory issue, I could find OOM
> ther, but not this time.
>
> Configuration:
> 1. each executor has 25G memory
> 2. garbage collection: G1GC
>
> From the garbage collector information (please see below), it looks like
> the executor still has heap space. So, I'm not so sure how should I
> improved it.
>
> One thing I can try is to configure Metaspace, e.g. -XX:MaxMetaspaceSize=256m.
> Any other suggestions? Thanks a lot.
>
> Best
>
> Xuan
>
> 4731.701: [GC pause (G1 Evacuation Pause) (young), 0.0965042 secs]
>    [Parallel Time: 85.5 ms, GC Workers: 28]
>       [GC Worker Start (ms): Min: 4731702.0, Avg: 4731722.0, Max:
> 4731746.6, Diff: 44.6]
>       [Ext Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 1.9, Diff: 1.8,
> Sum: 6.4]
>       [Update RS (ms): Min: 0.0, Avg: 22.1, Max: 40.8, Diff: 40.8, Sum:
> 617.7]
>          [Processed Buffers: Min: 0, Avg: 58.5, Max: 195, Diff: 195, Sum:
> 1639]
>       [Scan RS (ms): Min: 0.1, Avg: 1.2, Max: 1.9, Diff: 1.8, Sum: 34.1]
>       [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1,
> Sum: 1.3]
>       [Object Copy (ms): Min: 36.4, Avg: 38.0, Max: 41.6, Diff: 5.1, Sum:
> 1065.0]
>       [Termination (ms): Min: 0.0, Avg: 3.3, Max: 3.6, Diff: 3.6, Sum:
> 91.7]
>          [Termination Attempts: Min: 1, Avg: 29.9, Max: 45, Diff: 44, Sum:
> 837]
>       [GC Worker Other (ms): Min: 0.0, Avg: 0.3, Max: 0.6, Diff: 0.5, Sum:
> 8.1]
>       [GC Worker Total (ms): Min: 40.4, Avg: 65.1, Max: 85.4, Diff: 45.0,
> Sum: 1824.2]
>       [GC Worker End (ms): Min: 4731786.9, Avg: 4731787.2, Max: 4731787.4,
> Diff: 0.5]
>    [Code Root Fixup: 0.1 ms]
>    [Code Root Purge: 0.0 ms]
>    [Clear CT: 1.5 ms]
>    [Other: 9.4 ms]
>       [Choose CSet: 0.0 ms]
>       [Ref Proc: 1.0 ms]
>       [Ref Enq: 0.0 ms]
>       [Redirty Cards: 0.6 ms]
>       [Humongous Register: 0.6 ms]
>       [Humongous Reclaim: 1.1 ms]
>       [Free CSet: 5.0 ms]
>    *[Eden: 8280.0M(1136.0M)->0.0B(14.9G) Survivors: 144.0M->64.0M Heap:
> 10.5G(25.0G)->2021.9M(25.0G)]*
>  [Times: user=1.46 sys=0.01, real=0.09 secs]
> Heap
>  garbage-first heap   total 26214400K, used 11822248K [0x0000000180000000,
> 0x000000018040c800, 0x00000007c0000000)
>   region size 4096K, 2325 young (9523200K), 16 survivors (65536K)
>  Metaspace       used 76023K, capacity 77317K, committed 77616K, reserved
> 1118208K
>   class space    used 8649K, capacity 8958K, committed 9008K, reserved
> 1048576K
>  Concurrent marking:
>       0   init marks: total time =     0.00 s (avg =     0.00 ms).
>      10      remarks: total time =     0.70 s (avg =    70.35 ms).
>            [std. dev =    51.84 ms, max =   159.84 ms]
>         10  final marks: total time =     0.03 s (avg =     3.43 ms).
>               [std. dev =     2.84 ms, max =     9.66 ms]
>         10    weak refs: total time =     0.67 s (avg =    66.92 ms).
>               [std. dev =    52.19 ms, max =   157.75 ms]
>      10     cleanups: total time =     0.31 s (avg =    30.59 ms).
>            [std. dev =    24.03 ms, max =    71.07 ms]
>     Final counting total time =     0.05 s (avg =     5.13 ms).
>     RS scrub total time =     0.18 s (avg =    18.46 ms).
>   Total stop_world time =     1.01 s.
>   Total concurrent time =    42.94 s (   42.06 s marking).
>