You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nitin Goyal (JIRA)" <ji...@apache.org> on 2015/05/30 20:20:17 UTC
[jira] [Created] (SPARK-7970) Optimize code for SQL queries fired
on Union of RDDs (closure cleaner)
Nitin Goyal created SPARK-7970:
----------------------------------
Summary: Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
Key: SPARK-7970
URL: https://issues.apache.org/jira/browse/SPARK-7970
Project: Spark
Issue Type: Improvement
Components: Spark Core, SQL
Affects Versions: 1.3.0, 1.2.0
Reporter: Nitin Goyal
Closure cleaner slows down the execution of Spark SQL queries fired on union of RDDs. The time increases linearly at driver side with number of RDDs unioned. Refer following thread for more context :-
http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
As can be seen in attached screenshots of Jprofiler, lot of time is getting consumed in "getClassReader" method of ClosureCleaner and rest in "ensureSerializable" (atleast in my case)
This can be fixed in two ways (as per my current understanding) :-
1. Fixed at Spark SQL level - As pointed out by yhuai, we can create MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls ClosureCleaner clean method.
2. Fix at Spark core level -
(i) Make "checkSerializable" property driven in SparkContext's clean method
(ii) Somehow cache classreader for last 'n' classes
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org