You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Jaggi <mo...@gmail.com> on 2015/07/24 00:03:40 UTC

spark dataframe gc

Hi There,
I am testing Spark DataFrame and havn't been able to get my code to finish
due to what I suspect are GC issues. My guess is that GC interferes with
heartbeating and executors are detected as failed. The data is ~50 numeric
columns, ~100million rows in a CSV file.
We are doing a groupBy using one of the columns and trying to calculate the
average of each of the other columns. The groupBy key has about 250k unique
values.
It seems that Spark is creating a lot of temp objects (see jmap output
below) while calculating the average which I am surprised to see. Why
doesn't it use the same temp variable? Am I missing something? Do I need to
specify a config flag to enable code generation and not do this?


Mohit.

[xxxxx app-20150723142604-0002]$ jmap -histo 12209


 num     #instances         #bytes  class name

----------------------------------------------

   1:     258615458     8275694656  scala.collection.immutable.$colon$colon

   2:     103435856     7447381632
org.apache.spark.sql.catalyst.expressions.Cast

   3:     103435856     4964921088
org.apache.spark.sql.catalyst.expressions.Coalesce

   4:       1158643     4257400112  [B

   5:      51717929     4137434320
org.apache.spark.sql.catalyst.expressions.SumFunction

   6:      51717928     3723690816
org.apache.spark.sql.catalyst.expressions.Add

   7:      51717929     2896204024
org.apache.spark.sql.catalyst.expressions.CountFunction

   8:      51717928     2896203968
org.apache.spark.sql.catalyst.expressions.MutableLiteral

   9:      51717928     2482460544
org.apache.spark.sql.catalyst.expressions.Literal

  10:      51803728     1243289472  java.lang.Double

  11:      51717755     1241226120
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5

  12:        975810      850906320
[Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction;

  13:      51717754      827484064
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$org$apache$spark$sql$catalyst$expressions$Cast$$cast$1

  14:        982451       47157648  java.util.HashMap$Entry

  15:        981132       34981720  [Ljava.lang.Object;

  16:       1049984       25199616  org.apache.spark.sql.types.UTF8String

  17:        978296       23479104
org.apache.spark.sql.catalyst.expressions.GenericRow

  18:        117166       15944560  <methodKlass>

  19:        117166       14986224  <constMethodKlass>

  20:          1567       12891952  [Ljava.util.HashMap$Entry;

  21:          9103       10249728  <constantPoolKlass>

  22:          9103        9278592  <instanceKlassKlass>

  23:          5072        5691320  [I

  24:          7281        5335040  <constantPoolCacheKlass>

  25:         46420        4769600  [C

  26:        105984        3391488
io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry

Re: spark dataframe gc

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

This spark.shuffle.sort.bypassMergeThreshold might help, You could also try
setting the shuffle manager to hash from sort. You can see more
configuration options from here
<https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior>.

Thanks
Best Regards

On Fri, Jul 24, 2015 at 3:33 AM, Mohit Jaggi <mo...@gmail.com> wrote:

> Hi There,
> I am testing Spark DataFrame and havn't been able to get my code to finish
> due to what I suspect are GC issues. My guess is that GC interferes with
> heartbeating and executors are detected as failed. The data is ~50 numeric
> columns, ~100million rows in a CSV file.
> We are doing a groupBy using one of the columns and trying to calculate
> the average of each of the other columns. The groupBy key has about 250k
> unique values.
> It seems that Spark is creating a lot of temp objects (see jmap output
> below) while calculating the average which I am surprised to see. Why
> doesn't it use the same temp variable? Am I missing something? Do I need to
> specify a config flag to enable code generation and not do this?
>
>
> Mohit.
>
> [xxxxx app-20150723142604-0002]$ jmap -histo 12209
>
>
>  num     #instances         #bytes  class name
>
> ----------------------------------------------
>
>    1:     258615458     8275694656  scala.collection.immutable.$colon$colon
>
>    2:     103435856     7447381632
> org.apache.spark.sql.catalyst.expressions.Cast
>
>    3:     103435856     4964921088
> org.apache.spark.sql.catalyst.expressions.Coalesce
>
>    4:       1158643     4257400112  [B
>
>    5:      51717929     4137434320
> org.apache.spark.sql.catalyst.expressions.SumFunction
>
>    6:      51717928     3723690816
> org.apache.spark.sql.catalyst.expressions.Add
>
>    7:      51717929     2896204024
> org.apache.spark.sql.catalyst.expressions.CountFunction
>
>    8:      51717928     2896203968
> org.apache.spark.sql.catalyst.expressions.MutableLiteral
>
>    9:      51717928     2482460544
> org.apache.spark.sql.catalyst.expressions.Literal
>
>   10:      51803728     1243289472  java.lang.Double
>
>   11:      51717755     1241226120
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5
>
>   12:        975810      850906320
> [Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction;
>
>   13:      51717754      827484064
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$org$apache$spark$sql$catalyst$expressions$Cast$$cast$1
>
>   14:        982451       47157648  java.util.HashMap$Entry
>
>   15:        981132       34981720  [Ljava.lang.Object;
>
>   16:       1049984       25199616  org.apache.spark.sql.types.UTF8String
>
>   17:        978296       23479104
> org.apache.spark.sql.catalyst.expressions.GenericRow
>
>   18:        117166       15944560  <methodKlass>
>
>   19:        117166       14986224  <constMethodKlass>
>
>   20:          1567       12891952  [Ljava.util.HashMap$Entry;
>
>   21:          9103       10249728  <constantPoolKlass>
>
>   22:          9103        9278592  <instanceKlassKlass>
>
>   23:          5072        5691320  [I
>
>   24:          7281        5335040  <constantPoolCacheKlass>
>
>   25:         46420        4769600  [C
>
>   26:        105984        3391488
> io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
>