You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stefan Ackermann <st...@zuehlke.com> on 2017/02/11 17:46:49 UTC

Disable Spark SQL Optimizations for unit tests

Hi,

Can the Spark SQL Optimizations be disabled somehow?

In our project we started 4 weeks ago to write scala / spark / dataframe
code. We currently have only around 10% of the planned project scope, and we
are already waiting 10 (Spark 2.1.0, everything cached) to 30 (Spark 1.6,
nothing cached) minutes for a single unit test run to finish. We have for
example one scala file with maybe 80 lines of code (several joins, several
subtrees reused in different places) that takes up to 6 minutes to be
optimized (the catalyst output is also > 100 Mb). The input for our unit
tests is usually 2 - 3 rows. That is the motivation to disable the optimizer
in unit tests.

I have found this  unanswered SO post
<http://stackoverflow.com/questions/33984152/how-to-speed-up-spark-sql-unit-tests> 
, but not much more on that topic. I have also found this 
SimpleTestOptimizer
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L151>  
which sounds perfect, but I have no idea how to instantiate a Spark Session
so it uses that one. 

Does nobody else have this problem? Is there something fundamentally wrong
with our approach?

Regards,
Stefan Ackermann



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Disable-Spark-SQL-Optimizations-for-unit-tests-tp28380.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Disable Spark SQL Optimizations for unit tests

Posted by Stefan Ackermann <st...@zuehlke.com>.
I found some ways to get faster unit tests.In the meantime they had gone up
to about an hour.

Apparently defining columns in a for loop makes catalyst very slow, as it
blows up the logical plan with many projections:

  final def castInts(dfIn: DataFrame, castToInts: String*): DataFrame = {
    var df = dfIn
    for (toBeCasted <- castToInts) {
      df = df.withColumn(toBeCasted, df(toBeCasted).cast(IntegerType))
    }
    df
  }

This is much faster:

  final def castInts(dfIn: DataFrame, castToInts: String*): DataFrame = {
    val columns = dfIn.columns.map { c =>
      if (castToInts.contains(c)) {
        dfIn(c).cast(IntegerType)
      } else {
        dfIn(c)
      }
    }
    dfIn.select(columns: _*)    
  }

As I consequently applied this to other similar functions the unit tests
went down from 60 to 18 minutes.

Another way to break SQL optimizations was to just save an intermediate
dataframe to HDFS and read from there again. This is quite counter
intuitive, but the unit tests then further went down from 18 minutes to 5.

Is there any other way to add a barrier for catalyst optimizations? As in A
-> B -> C, only optimize A -> B, and B -> C but not the complete A -> C?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Disable-Spark-SQL-Optimizations-for-unit-tests-tp28380p28426.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org