You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jim Carroll <ji...@gmail.com> on 2014/11/12 20:34:07 UTC

Wildly varying "aggregate" performance depending on code location

Hello all,

I have a really strange thing going on.

I have a test data set with 500K lines in a gzipped csv file.

I have an array of "column processors," one for each column in the dataset.
A Processor tracks aggregate state and has a method "process(v : String)"

I'm calling:

  val processors: Array[Processors] = ....

  sc.textFile(gzippedFileName).aggregate(processors,
    { (curState, row) =>
        row.split(",", -1).zipWithIndex.foreach({
          v => curState(v._2).process(v._1)
        })
        curState
    } ....)

If the class definition for the Processors is in the same file as the driver
it runs in ~23 seconds. If I move the classes to a separate file in the same
package without ANY OTHER CHANGES it goes to ~35 seconds.

This doesn't make any sense to me. I can't even understand how the compiled
class files could be any different in either case.

Does anyone have an explanation for why this might be?




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Wildly-varying-aggregate-performance-depending-on-code-location-tp18752.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Wildly varying "aggregate" performance depending on code location

Posted by Jim Carroll <ji...@gmail.com>.

Well it looks like this is a scala problem after all. I loaded the file
using pure scala and ran the exact same Processors without Spark and I got
20 seconds (with the code in the same file as the 'main') vs 30 seconds
(with the exact same code in a different file) on the 500K rows.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Wildly-varying-aggregate-performance-depending-on-code-location-tp18752p18772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org