You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nitin Goyal (JIRA)" <ji...@apache.org> on 2015/05/30 20:20:17 UTC

[jira] [Created] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

Nitin Goyal created SPARK-7970:
----------------------------------

             Summary: Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
                 Key: SPARK-7970
                 URL: https://issues.apache.org/jira/browse/SPARK-7970
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, SQL
    Affects Versions: 1.3.0, 1.2.0
            Reporter: Nitin Goyal


Closure cleaner slows down the execution of Spark SQL queries fired on union of RDDs. The time increases linearly at driver side with number of RDDs unioned. Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting consumed in "getClassReader" method of ClosureCleaner and rest in "ensureSerializable" (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls ClosureCleaner clean method.

2. Fix at Spark core level -
  (i) Make "checkSerializable" property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org