You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "lichenglin (JIRA)" <ji...@apache.org> on 2016/07/04 08:28:11 UTC

[jira] [Created] (SPARK-16361) It takes a long time for gc when building cube with many fields

lichenglin created SPARK-16361:
----------------------------------

             Summary: It takes a long time for gc when building cube with  many fields
                 Key: SPARK-16361
                 URL: https://issues.apache.org/jira/browse/SPARK-16361
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.6.2
            Reporter: lichenglin


I'm using spark to build cube on a dataframe with 1m data.
I found that when I add too many fields (about 8 or above) 
the worker takes a lot of time for GC.
I try to increase the memory of each worker but it not work well.
but I don't know why,sorry.
here is my simple code and monitoring 
Cuber is a util class for building cube.

{code:title=Bar.java|borderStyle=solid}
		sqlContext.udf().register("jidu", (Integer f) -> {
			return (f - 1) / 3 + 1;

		} , DataTypes.IntegerType);
		DataFrame d = sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as double) as c_age",
				"month(day) as month", "year(day) as year", "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
				"jidu(month(day)) as jidu");
		Bucketizer b = new Bucketizer().setInputCol("c_age").setSplits(new double[] { Double.NEGATIVE_INFINITY, 0, 10,
				20, 30, 40, 50, 60, 70, 80, 90, 100, Double.POSITIVE_INFINITY }).setOutputCol("age");
		DataFrame cube = new Cuber(b.transform(d))
				.addFields("day", "AREA_CODE", "CUST_TYPE", "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
				.min("age").sum("zwsc").count().buildcube();
		cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
{code}
Summary Metrics for 12 Completed Tasks

Metric	Min	25th percentile	Median	75th percentile	Max
Duration	2.6 min	2.7 min	2.7 min	2.7 min	2.7 min
GC Time	1.6 min	1.6 min	1.6 min	1.6 min	1.6 min
Shuffle Read Size / Records	728.4 KB / 21886	736.6 KB / 22258	738.7 KB / 22387	746.6 KB / 22542	748.6 KB / 22783
Shuffle Write Size / Records	74.3 MB / 1926282	75.8 MB / 1965860	76.2 MB / 1976004	76.4 MB / 1981516	77.9 MB / 2021142




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org