You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "lichenglin (JIRA)" <ji...@apache.org> on 2016/07/04 08:28:11 UTC
[jira] [Created] (SPARK-16361) It takes a long time for gc when
building cube with many fields
lichenglin created SPARK-16361:
----------------------------------
Summary: It takes a long time for gc when building cube with many fields
Key: SPARK-16361
URL: https://issues.apache.org/jira/browse/SPARK-16361
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.6.2
Reporter: lichenglin
I'm using spark to build cube on a dataframe with 1m data.
I found that when I add too many fields (about 8 or above)
the worker takes a lot of time for GC.
I try to increase the memory of each worker but it not work well.
but I don't know why,sorry.
here is my simple code and monitoring
Cuber is a util class for building cube.
{code:title=Bar.java|borderStyle=solid}
sqlContext.udf().register("jidu", (Integer f) -> {
return (f - 1) / 3 + 1;
} , DataTypes.IntegerType);
DataFrame d = sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as double) as c_age",
"month(day) as month", "year(day) as year", "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
"jidu(month(day)) as jidu");
Bucketizer b = new Bucketizer().setInputCol("c_age").setSplits(new double[] { Double.NEGATIVE_INFINITY, 0, 10,
20, 30, 40, 50, 60, 70, 80, 90, 100, Double.POSITIVE_INFINITY }).setOutputCol("age");
DataFrame cube = new Cuber(b.transform(d))
.addFields("day", "AREA_CODE", "CUST_TYPE", "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
.min("age").sum("zwsc").count().buildcube();
cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
{code}
Summary Metrics for 12 Completed Tasks
Metric Min 25th percentile Median 75th percentile Max
Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
Shuffle Read Size / Records 728.4 KB / 21886 736.6 KB / 22258 738.7 KB / 22387 746.6 KB / 22542 748.6 KB / 22783
Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org