You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "lichenglin (JIRA)" <ji...@apache.org> on 2016/07/04 09:04:11 UTC
[jira] [Comment Edited] (SPARK-16361) It takes a long time for gc
when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361039#comment-15361039 ]
lichenglin edited comment on SPARK-16361 at 7/4/16 9:03 AM:
------------------------------------------------------------
"A long time" means the gctime/Duration of each task.
you can find it in the monitoring server in some stage.
Every fields I add, this percent increase too ,until stuck the whole job.
My data's size is about 1 million, 1 node with 16 cores and 64 GB memory.
I have increase the memory of executor from 20 GB to 40 GB but not work well.
was (Author: licl):
"A long time" means the gctime/Duration of each task.
you can find it in the monitoring server.
Every fields I add, this percent increase too ,until stuck the whole job.
My data's size is about 1 million, 1 node with 16 cores and 64 GB memory.
I have increase the memory of executor from 20 GB to 40 GB but not work well.
> It takes a long time for gc when building cube with many fields
> ----------------------------------------------------------------
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.6.2
> Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above)
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
> sqlContext.udf().register("jidu", (Integer f) -> {
> return (f - 1) / 3 + 1;
> } , DataTypes.IntegerType);
> DataFrame d = sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as double) as c_age",
> "month(day) as month", "year(day) as year", "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
> "jidu(month(day)) as jidu");
> Bucketizer b = new Bucketizer().setInputCol("c_age").setSplits(new double[] { Double.NEGATIVE_INFINITY, 0, 10,
> 20, 30, 40, 50, 60, 70, 80, 90, 100, Double.POSITIVE_INFINITY }).setOutputCol("age");
> DataFrame cube = new Cuber(b.transform(d))
> .addFields("day", "AREA_CODE", "CUST_TYPE", "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
> .min("age").sum("zwsc").count().buildcube();
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> Metric Min 25th percentile Median 75th percentile Max
> Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records 728.4 KB / 21886 736.6 KB / 22258 738.7 KB / 22387 746.6 KB / 22542 748.6 KB / 22783
> Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org