You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mohit Jaggi <mo...@gmail.com> on 2015/07/24 00:03:40 UTC
spark dataframe gc
Hi There,
I am testing Spark DataFrame and havn't been able to get my code to finish
due to what I suspect are GC issues. My guess is that GC interferes with
heartbeating and executors are detected as failed. The data is ~50 numeric
columns, ~100million rows in a CSV file.
We are doing a groupBy using one of the columns and trying to calculate the
average of each of the other columns. The groupBy key has about 250k unique
values.
It seems that Spark is creating a lot of temp objects (see jmap output
below) while calculating the average which I am surprised to see. Why
doesn't it use the same temp variable? Am I missing something? Do I need to
specify a config flag to enable code generation and not do this?
Mohit.
[xxxxx app-20150723142604-0002]$ jmap -histo 12209
num #instances #bytes class name
----------------------------------------------
1: 258615458 8275694656 scala.collection.immutable.$colon$colon
2: 103435856 7447381632
org.apache.spark.sql.catalyst.expressions.Cast
3: 103435856 4964921088
org.apache.spark.sql.catalyst.expressions.Coalesce
4: 1158643 4257400112 [B
5: 51717929 4137434320
org.apache.spark.sql.catalyst.expressions.SumFunction
6: 51717928 3723690816
org.apache.spark.sql.catalyst.expressions.Add
7: 51717929 2896204024
org.apache.spark.sql.catalyst.expressions.CountFunction
8: 51717928 2896203968
org.apache.spark.sql.catalyst.expressions.MutableLiteral
9: 51717928 2482460544
org.apache.spark.sql.catalyst.expressions.Literal
10: 51803728 1243289472 java.lang.Double
11: 51717755 1241226120
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5
12: 975810 850906320
[Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction;
13: 51717754 827484064
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$org$apache$spark$sql$catalyst$expressions$Cast$$cast$1
14: 982451 47157648 java.util.HashMap$Entry
15: 981132 34981720 [Ljava.lang.Object;
16: 1049984 25199616 org.apache.spark.sql.types.UTF8String
17: 978296 23479104
org.apache.spark.sql.catalyst.expressions.GenericRow
18: 117166 15944560 <methodKlass>
19: 117166 14986224 <constMethodKlass>
20: 1567 12891952 [Ljava.util.HashMap$Entry;
21: 9103 10249728 <constantPoolKlass>
22: 9103 9278592 <instanceKlassKlass>
23: 5072 5691320 [I
24: 7281 5335040 <constantPoolCacheKlass>
25: 46420 4769600 [C
26: 105984 3391488
io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
Re: spark dataframe gc
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
This spark.shuffle.sort.bypassMergeThreshold might help, You could also try
setting the shuffle manager to hash from sort. You can see more
configuration options from here
<https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior>.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 3:33 AM, Mohit Jaggi <mo...@gmail.com> wrote:
> Hi There,
> I am testing Spark DataFrame and havn't been able to get my code to finish
> due to what I suspect are GC issues. My guess is that GC interferes with
> heartbeating and executors are detected as failed. The data is ~50 numeric
> columns, ~100million rows in a CSV file.
> We are doing a groupBy using one of the columns and trying to calculate
> the average of each of the other columns. The groupBy key has about 250k
> unique values.
> It seems that Spark is creating a lot of temp objects (see jmap output
> below) while calculating the average which I am surprised to see. Why
> doesn't it use the same temp variable? Am I missing something? Do I need to
> specify a config flag to enable code generation and not do this?
>
>
> Mohit.
>
> [xxxxx app-20150723142604-0002]$ jmap -histo 12209
>
>
> num #instances #bytes class name
>
> ----------------------------------------------
>
> 1: 258615458 8275694656 scala.collection.immutable.$colon$colon
>
> 2: 103435856 7447381632
> org.apache.spark.sql.catalyst.expressions.Cast
>
> 3: 103435856 4964921088
> org.apache.spark.sql.catalyst.expressions.Coalesce
>
> 4: 1158643 4257400112 [B
>
> 5: 51717929 4137434320
> org.apache.spark.sql.catalyst.expressions.SumFunction
>
> 6: 51717928 3723690816
> org.apache.spark.sql.catalyst.expressions.Add
>
> 7: 51717929 2896204024
> org.apache.spark.sql.catalyst.expressions.CountFunction
>
> 8: 51717928 2896203968
> org.apache.spark.sql.catalyst.expressions.MutableLiteral
>
> 9: 51717928 2482460544
> org.apache.spark.sql.catalyst.expressions.Literal
>
> 10: 51803728 1243289472 java.lang.Double
>
> 11: 51717755 1241226120
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5
>
> 12: 975810 850906320
> [Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction;
>
> 13: 51717754 827484064
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$org$apache$spark$sql$catalyst$expressions$Cast$$cast$1
>
> 14: 982451 47157648 java.util.HashMap$Entry
>
> 15: 981132 34981720 [Ljava.lang.Object;
>
> 16: 1049984 25199616 org.apache.spark.sql.types.UTF8String
>
> 17: 978296 23479104
> org.apache.spark.sql.catalyst.expressions.GenericRow
>
> 18: 117166 15944560 <methodKlass>
>
> 19: 117166 14986224 <constMethodKlass>
>
> 20: 1567 12891952 [Ljava.util.HashMap$Entry;
>
> 21: 9103 10249728 <constantPoolKlass>
>
> 22: 9103 9278592 <instanceKlassKlass>
>
> 23: 5072 5691320 [I
>
> 24: 7281 5335040 <constantPoolCacheKlass>
>
> 25: 46420 4769600 [C
>
> 26: 105984 3391488
> io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
>