You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Ulanov, Alexander" <al...@hp.com> on 2015/03/13 01:34:40 UTC

Profiling Spark: MemoryStore

Hi,

I am working on artificial neural networks for Spark. It is solved with Gradient Descent, so each step the data is read, sum of gradients is calculated for each data partition (on each worker), aggregated (on the driver) and broadcasted back. I noticed that the gradient computation time is few times less than the total time needed for each step. To narrow down my observation, I run the gradient on a single machine with single partition of data of site 100MB that I persist (data.persist). This should minimize the overhead for aggregation at least, but the gradient computation still takes much less time than the whole step. Just in case, data is loaded by MLUtil. loadLibSVMFile in RDD[LabeledPoint], this is my code:

    val conf = new SparkConf().setAppName("myApp").setMaster("local[2]")
    val train = MLUtils.loadLibSVMFile(new SparkContext(conf), "/data/mnist/mnist.scale").repartition(1).persist()
    val model = ANN2Classifier.train(train, 1000, Array[Int](32), 10, 1e-4) //training data, batch size, hidden layer size, iterations, LBFGS tolerance

Profiler shows that there are two threads, one is doing Gradient and the other I don't know what. The Gradient takes 10% of this thread. Almost all other time is spent by MemoryStore. Below is the screenshot (first thread):
https://drive.google.com/file/d/0BzYMzvDiCep5bGp2S2F6eE9TRlk/view?usp=sharing
Second thread:
https://drive.google.com/file/d/0BzYMzvDiCep5OHA0WUtQbXd3WmM/view?usp=sharing

Could Spark developers please elaborate what's going on in MemoryStore? It seems that it does some string operations (parsing libsvm file? Why every step?) and a lot of InputStream reading. It seems that the overall time depends on the size of the data batch (or size of vector) I am processing. However it does not seems linear to me.

Also, I would like to know how to speedup these operations.

Best regards, Alexander

Re: Profiling Spark: MemoryStore

Posted by Kay Ousterhout <ke...@eecs.berkeley.edu>.

Hi Alexander,

The stack trace is a little misleading here: all of the time is spent in
MemoryStore, but that's because MemoryStore is unrolling an iterator (note
the iterator.next()) call so that it can be stored in-memory.  Essentially
all of the computation for the tasks happens as part of that
iterator.next() call, which is why you're seeing a combination of
deserializing input data with Snappy (the InputStream reading) and some
MLLib processing.

-Kay

On Thu, Mar 12, 2015 at 5:34 PM, Ulanov, Alexander <al...@hp.com>
wrote:

> Hi,
>
> I am working on artificial neural networks for Spark. It is solved with
> Gradient Descent, so each step the data is read, sum of gradients is
> calculated for each data partition (on each worker), aggregated (on the
> driver) and broadcasted back. I noticed that the gradient computation time
> is few times less than the total time needed for each step. To narrow down
> my observation, I run the gradient on a single machine with single
> partition of data of site 100MB that I persist (data.persist). This should
> minimize the overhead for aggregation at least, but the gradient
> computation still takes much less time than the whole step. Just in case,
> data is loaded by MLUtil. loadLibSVMFile in RDD[LabeledPoint], this is my
> code:
>
>     val conf = new SparkConf().setAppName("myApp").setMaster("local[2]")
>     val train = MLUtils.loadLibSVMFile(new SparkContext(conf),
> "/data/mnist/mnist.scale").repartition(1).persist()
>     val model = ANN2Classifier.train(train, 1000, Array[Int](32), 10,
> 1e-4) //training data, batch size, hidden layer size, iterations, LBFGS
> tolerance
>
> Profiler shows that there are two threads, one is doing Gradient and the
> other I don't know what. The Gradient takes 10% of this thread. Almost all
> other time is spent by MemoryStore. Below is the screenshot (first thread):
>
> https://drive.google.com/file/d/0BzYMzvDiCep5bGp2S2F6eE9TRlk/view?usp=sharing
> Second thread:
>
> https://drive.google.com/file/d/0BzYMzvDiCep5OHA0WUtQbXd3WmM/view?usp=sharing
>
> Could Spark developers please elaborate what's going on in MemoryStore? It
> seems that it does some string operations (parsing libsvm file? Why every
> step?) and a lot of InputStream reading. It seems that the overall time
> depends on the size of the data batch (or size of vector) I am processing.
> However it does not seems linear to me.
>
> Also, I would like to know how to speedup these operations.
>
> Best regards, Alexander
>
>