You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Ulanov, Alexander" <al...@hp.com> on 2015/04/01 19:09:05 UTC

RE: Stochastic gradient descent performance

Sorry for bothering you again, but I think that it is an important issue for applicability of SGD in Spark MLlib. Could Spark developers please comment on it.

-----Original Message-----
From: Ulanov, Alexander 
Sent: Monday, March 30, 2015 5:00 PM
To: dev@spark.apache.org
Subject: Stochastic gradient descent performance

Hi,

It seems to me that there is an overhead in "runMiniBatchSGD" function of MLlib's "GradientDescent". In particular, "sample" and "treeAggregate" might take time that is order of magnitude greater than the actual gradient computation. In particular, for mnist dataset of 60K instances, minibatch size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to aggregate in local mode with 1 data partition on Core i5 processor. The actual gradient computation takes 0.002 s. I searched through Spark Jira and found that there was recently an update for more efficient sampling (SPARK-3250) that is already included in Spark codebase. Is there a way to reduce the sampling time and local treeRedeuce by order of magnitude?

Best regards, Alexander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Stochastic gradient descent performance

Posted by Reynold Xin <rx...@databricks.com>.

Note that we can do this in DataFrames and use Catalyst to push Sample down
beneath Projection :)

On Mon, Apr 6, 2015 at 12:42 PM, Xiangrui Meng <me...@gmail.com> wrote:

> The gap sampling is triggered when the sampling probability is small
> and the directly underlying storage has constant time lookups, in
> particular, ArrayBuffer. This is a very strict requirement. If rdd is
> cached in memory, we use ArrayBuffer to store its elements and
> rdd.sample will trigger gap sampling. However, if we call rdd2 =
> rdd.map(x => x), we can no longer tell whether the storage is backed
> by an ArrayBuffer and hence gaps sampling is not enabled. We should
> use Scala's drop(k) and let Scala decides whether this is an O(1)
> operation or an O(k) operation. But unfortunately, due to a Scala bug,
> this could become an O(k^2) operation. So we didn't use this approach.
> Please check the comments in the gap sampling PR.
>
> For SGD, I think we should either assume the input data is randomized
> or randomize the input data (and eat this one-time cost), then do
> min-batch sequentially. The key is the balance the batch size and the
> communication cost of model update.
>
> Best,
> Xiangrui
>
> On Mon, Apr 6, 2015 at 10:38 AM, Ulanov, Alexander
> <al...@hp.com> wrote:
> > Batch size impacts convergence, so bigger batch means more iterations.
> There are some approaches to deal with it (such as
> http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf), but they need to be
> implemented and tested.
> >
> > Nonetheless, could you share your thoughts regarding reducing this
> overhead in Spark (or probably a workaround)? Sorry for repeating it, but I
> think this is crucial for MLlib in Spark, because Spark is intended for
> bigger amounts of data. Machine learning with bigger data usually requires
> SGD (vs batch GD), SGD requires a lot of updates, and “Spark overhead”
> times “many updates” equals impractical time needed for learning.
> >
> >
> > From: Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu]
> > Sent: Sunday, April 05, 2015 7:13 PM
> > To: Ulanov, Alexander
> > Cc: shivaram@eecs.berkeley.edu; Joseph Bradley; dev@spark.apache.org
> > Subject: Re: Stochastic gradient descent performance
> >
> > Yeah, a simple way to estimate the time for an iterative algorithms is
> number of iterations required * time per iteration. The time per iteration
> will depend on the batch size, computation required and the fixed overheads
> I mentioned before. The number of iterations of course depends on the
> convergence rate for the problem being solved.
> >
> > Thanks
> > Shivaram
> >
> > On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander <
> alexander.ulanov@hp.com<ma...@hp.com>> wrote:
> > Hi Shivaram,
> >
> > It sounds really interesting! With this time we can estimate if it worth
> considering to run an iterative algorithm on Spark. For example, for SGD on
> Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all
> data by one example not considering the data loading, computation and
> update times. One may need to traverse all data a number of times to
> converge. Let’s say this number is equal to the batch size. So, we remain
> with 62.5 hours overhead. Is it reasonable?
> >
> > Best regards, Alexander
> >
> > From: Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu<mailto:
> shivaram@eecs.berkeley.edu>]
> > Sent: Thursday, April 02, 2015 1:26 PM
> > To: Joseph Bradley
> > Cc: Ulanov, Alexander; dev@spark.apache.org<ma...@spark.apache.org>
> > Subject: Re: Stochastic gradient descent performance
> >
> > I haven't looked closely at the sampling issues, but regarding the
> aggregation latency, there are fixed overheads (in local and distributed
> mode) with the way aggregation is done in Spark. Launching a stage of
> tasks, fetching outputs from the previous stage etc. all have overhead, so
> I would say its not efficient / recommended to run stages where computation
> is less than 500ms or so. You could increase your batch size based on this
> and hopefully that will help.
> >
> > Regarding reducing these overheads by an order of magnitude it is a
> challenging problem given the architecture in Spark -- I have some ideas
> for this, but they are very much at a research stage.
> >
> > Thanks
> > Shivaram
> >
> > On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <joseph@databricks.com
> <ma...@databricks.com>> wrote:
> > When you say "It seems that instead of sample it is better to shuffle
> data
> > and then access it sequentially by mini-batches," are you sure that holds
> > true for a big dataset in a cluster?  As far as implementing it, I
> haven't
> > looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
> > but that looks like it could be modified to be deterministic.
> >
> > Hopefully someone else can comment on aggregation in local mode.  I'm not
> > sure how much effort has gone into optimizing for local mode.
> >
> > Joseph
> >
> > On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander <
> alexander.ulanov@hp.com<ma...@hp.com>>
> > wrote:
> >
> >>  Hi Joseph,
> >>
> >>
> >>
> >> Thank you for suggestion!
> >>
> >> It seems that instead of sample it is better to shuffle data and then
> >> access it sequentially by mini-batches. Could you suggest how to
> implement
> >> it?
> >>
> >>
> >>
> >> With regards to aggregate (reduce), I am wondering why it works so slow
> in
> >> local mode? Could you elaborate on this? I do understand that in cluster
> >> mode the network speed will kick in and then one can blame it.
> >>
> >>
> >>
> >> Best regards, Alexander
> >>
> >>
> >>
> >> *From:* Joseph Bradley [mailto:joseph@databricks.com<mailto:
> joseph@databricks.com>]
> >> *Sent:* Thursday, April 02, 2015 10:51 AM
> >> *To:* Ulanov, Alexander
> >> *Cc:* dev@spark.apache.org<ma...@spark.apache.org>
> >> *Subject:* Re: Stochastic gradient descent performance
> >>
> >>
> >>
> >> It looks like SPARK-3250 was applied to the sample() which
> GradientDescent
> >> uses, and that should kick in for your minibatchFraction <= 0.4.  Based
> on
> >> your numbers, aggregation seems like the main issue, though I hesitate
> to
> >> optimize aggregation based on local tests for data sizes that small.
> >>
> >>
> >>
> >> The first thing I'd check for is unnecessary object creation, and to
> >> profile in a cluster or larger data setting.
> >>
> >>
> >>
> >> On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
> >> alexander.ulanov@hp.com<ma...@hp.com>> wrote:
> >>
> >> Sorry for bothering you again, but I think that it is an important issue
> >> for applicability of SGD in Spark MLlib. Could Spark developers please
> >> comment on it.
> >>
> >>
> >> -----Original Message-----
> >> From: Ulanov, Alexander
> >> Sent: Monday, March 30, 2015 5:00 PM
> >> To: dev@spark.apache.org<ma...@spark.apache.org>
> >> Subject: Stochastic gradient descent performance
> >>
> >> Hi,
> >>
> >> It seems to me that there is an overhead in "runMiniBatchSGD" function
> of
> >> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> >> might take time that is order of magnitude greater than the actual
> gradient
> >> computation. In particular, for mnist dataset of 60K instances,
> minibatch
> >> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> >> aggregate in local mode with 1 data partition on Core i5 processor. The
> >> actual gradient computation takes 0.002 s. I searched through Spark Jira
> >> and found that there was recently an update for more efficient sampling
> >> (SPARK-3250) that is already included in Spark codebase. Is there a way
> to
> >> reduce the sampling time and local treeRedeuce by order of magnitude?
> >>
> >> Best regards, Alexander
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<mailto:
> dev-unsubscribe@spark.apache.org>
> >> For additional commands, e-mail: dev-help@spark.apache.org<mailto:
> dev-help@spark.apache.org>
> >>
> >>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Stochastic gradient descent performance

Posted by Xiangrui Meng <me...@gmail.com>.

The gap sampling is triggered when the sampling probability is small
and the directly underlying storage has constant time lookups, in
particular, ArrayBuffer. This is a very strict requirement. If rdd is
cached in memory, we use ArrayBuffer to store its elements and
rdd.sample will trigger gap sampling. However, if we call rdd2 =
rdd.map(x => x), we can no longer tell whether the storage is backed
by an ArrayBuffer and hence gaps sampling is not enabled. We should
use Scala's drop(k) and let Scala decides whether this is an O(1)
operation or an O(k) operation. But unfortunately, due to a Scala bug,
this could become an O(k^2) operation. So we didn't use this approach.
Please check the comments in the gap sampling PR.

For SGD, I think we should either assume the input data is randomized
or randomize the input data (and eat this one-time cost), then do
min-batch sequentially. The key is the balance the batch size and the
communication cost of model update.

Best,
Xiangrui

On Mon, Apr 6, 2015 at 10:38 AM, Ulanov, Alexander
<al...@hp.com> wrote:
> Batch size impacts convergence, so bigger batch means more iterations. There are some approaches to deal with it (such as http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf), but they need to be implemented and tested.
>
> Nonetheless, could you share your thoughts regarding reducing this overhead in Spark (or probably a workaround)? Sorry for repeating it, but I think this is crucial for MLlib in Spark, because Spark is intended for bigger amounts of data. Machine learning with bigger data usually requires SGD (vs batch GD), SGD requires a lot of updates, and “Spark overhead” times “many updates” equals impractical time needed for learning.
>
>
> From: Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu]
> Sent: Sunday, April 05, 2015 7:13 PM
> To: Ulanov, Alexander
> Cc: shivaram@eecs.berkeley.edu; Joseph Bradley; dev@spark.apache.org
> Subject: Re: Stochastic gradient descent performance
>
> Yeah, a simple way to estimate the time for an iterative algorithms is number of iterations required * time per iteration. The time per iteration will depend on the batch size, computation required and the fixed overheads I mentioned before. The number of iterations of course depends on the convergence rate for the problem being solved.
>
> Thanks
> Shivaram
>
> On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander <al...@hp.com>> wrote:
> Hi Shivaram,
>
> It sounds really interesting! With this time we can estimate if it worth considering to run an iterative algorithm on Spark. For example, for SGD on Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all data by one example not considering the data loading, computation and update times. One may need to traverse all data a number of times to converge. Let’s say this number is equal to the batch size. So, we remain with 62.5 hours overhead. Is it reasonable?
>
> Best regards, Alexander
>
> From: Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>]
> Sent: Thursday, April 02, 2015 1:26 PM
> To: Joseph Bradley
> Cc: Ulanov, Alexander; dev@spark.apache.org<ma...@spark.apache.org>
> Subject: Re: Stochastic gradient descent performance
>
> I haven't looked closely at the sampling issues, but regarding the aggregation latency, there are fixed overheads (in local and distributed mode) with the way aggregation is done in Spark. Launching a stage of tasks, fetching outputs from the previous stage etc. all have overhead, so I would say its not efficient / recommended to run stages where computation is less than 500ms or so. You could increase your batch size based on this and hopefully that will help.
>
> Regarding reducing these overheads by an order of magnitude it is a challenging problem given the architecture in Spark -- I have some ideas for this, but they are very much at a research stage.
>
> Thanks
> Shivaram
>
> On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jo...@databricks.com>> wrote:
> When you say "It seems that instead of sample it is better to shuffle data
> and then access it sequentially by mini-batches," are you sure that holds
> true for a big dataset in a cluster?  As far as implementing it, I haven't
> looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
> but that looks like it could be modified to be deterministic.
>
> Hopefully someone else can comment on aggregation in local mode.  I'm not
> sure how much effort has gone into optimizing for local mode.
>
> Joseph
>
> On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander <al...@hp.com>>
> wrote:
>
>>  Hi Joseph,
>>
>>
>>
>> Thank you for suggestion!
>>
>> It seems that instead of sample it is better to shuffle data and then
>> access it sequentially by mini-batches. Could you suggest how to implement
>> it?
>>
>>
>>
>> With regards to aggregate (reduce), I am wondering why it works so slow in
>> local mode? Could you elaborate on this? I do understand that in cluster
>> mode the network speed will kick in and then one can blame it.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Joseph Bradley [mailto:joseph@databricks.com<ma...@databricks.com>]
>> *Sent:* Thursday, April 02, 2015 10:51 AM
>> *To:* Ulanov, Alexander
>> *Cc:* dev@spark.apache.org<ma...@spark.apache.org>
>> *Subject:* Re: Stochastic gradient descent performance
>>
>>
>>
>> It looks like SPARK-3250 was applied to the sample() which GradientDescent
>> uses, and that should kick in for your minibatchFraction <= 0.4.  Based on
>> your numbers, aggregation seems like the main issue, though I hesitate to
>> optimize aggregation based on local tests for data sizes that small.
>>
>>
>>
>> The first thing I'd check for is unnecessary object creation, and to
>> profile in a cluster or larger data setting.
>>
>>
>>
>> On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
>> alexander.ulanov@hp.com<ma...@hp.com>> wrote:
>>
>> Sorry for bothering you again, but I think that it is an important issue
>> for applicability of SGD in Spark MLlib. Could Spark developers please
>> comment on it.
>>
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>> Sent: Monday, March 30, 2015 5:00 PM
>> To: dev@spark.apache.org<ma...@spark.apache.org>
>> Subject: Stochastic gradient descent performance
>>
>> Hi,
>>
>> It seems to me that there is an overhead in "runMiniBatchSGD" function of
>> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
>> might take time that is order of magnitude greater than the actual gradient
>> computation. In particular, for mnist dataset of 60K instances, minibatch
>> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
>> aggregate in local mode with 1 data partition on Core i5 processor. The
>> actual gradient computation takes 0.002 s. I searched through Spark Jira
>> and found that there was recently an update for more efficient sampling
>> (SPARK-3250) that is already included in Spark codebase. Is there a way to
>> reduce the sampling time and local treeRedeuce by order of magnitude?
>>
>> Best regards, Alexander
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>> For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>
>>
>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

RE: Stochastic gradient descent performance

Posted by "Ulanov, Alexander" <al...@hp.com>.

Batch size impacts convergence, so bigger batch means more iterations. There are some approaches to deal with it (such as http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf), but they need to be implemented and tested.

Nonetheless, could you share your thoughts regarding reducing this overhead in Spark (or probably a workaround)? Sorry for repeating it, but I think this is crucial for MLlib in Spark, because Spark is intended for bigger amounts of data. Machine learning with bigger data usually requires SGD (vs batch GD), SGD requires a lot of updates, and “Spark overhead” times “many updates” equals impractical time needed for learning.

From: Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu]
Sent: Sunday, April 05, 2015 7:13 PM
To: Ulanov, Alexander
Cc: shivaram@eecs.berkeley.edu; Joseph Bradley; dev@spark.apache.org
Subject: Re: Stochastic gradient descent performance

Yeah, a simple way to estimate the time for an iterative algorithms is number of iterations required * time per iteration. The time per iteration will depend on the batch size, computation required and the fixed overheads I mentioned before. The number of iterations of course depends on the convergence rate for the problem being solved.

Thanks
Shivaram

On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander <al...@hp.com>> wrote:
Hi Shivaram,

It sounds really interesting! With this time we can estimate if it worth considering to run an iterative algorithm on Spark. For example, for SGD on Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all data by one example not considering the data loading, computation and update times. One may need to traverse all data a number of times to converge. Let’s say this number is equal to the batch size. So, we remain with 62.5 hours overhead. Is it reasonable?

Best regards, Alexander

From: Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>]
Sent: Thursday, April 02, 2015 1:26 PM
To: Joseph Bradley
Cc: Ulanov, Alexander; dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Stochastic gradient descent performance

I haven't looked closely at the sampling issues, but regarding the aggregation latency, there are fixed overheads (in local and distributed mode) with the way aggregation is done in Spark. Launching a stage of tasks, fetching outputs from the previous stage etc. all have overhead, so I would say its not efficient / recommended to run stages where computation is less than 500ms or so. You could increase your batch size based on this and hopefully that will help.

Regarding reducing these overheads by an order of magnitude it is a challenging problem given the architecture in Spark -- I have some ideas for this, but they are very much at a research stage.

Thanks
Shivaram

On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jo...@databricks.com>> wrote:
When you say "It seems that instead of sample it is better to shuffle data
and then access it sequentially by mini-batches," are you sure that holds
true for a big dataset in a cluster?  As far as implementing it, I haven't
looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
but that looks like it could be modified to be deterministic.

Hopefully someone else can comment on aggregation in local mode.  I'm not
sure how much effort has gone into optimizing for local mode.

Joseph

On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander <al...@hp.com>>
wrote:

>  Hi Joseph,
>
>
>
> Thank you for suggestion!
>
> It seems that instead of sample it is better to shuffle data and then
> access it sequentially by mini-batches. Could you suggest how to implement
> it?
>
>
>
> With regards to aggregate (reduce), I am wondering why it works so slow in
> local mode? Could you elaborate on this? I do understand that in cluster
> mode the network speed will kick in and then one can blame it.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Joseph Bradley [mailto:joseph@databricks.com<ma...@databricks.com>]
> *Sent:* Thursday, April 02, 2015 10:51 AM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org<ma...@spark.apache.org>
> *Subject:* Re: Stochastic gradient descent performance
>
>
>
> It looks like SPARK-3250 was applied to the sample() which GradientDescent
> uses, and that should kick in for your minibatchFraction <= 0.4.  Based on
> your numbers, aggregation seems like the main issue, though I hesitate to
> optimize aggregation based on local tests for data sizes that small.
>
>
>
> The first thing I'd check for is unnecessary object creation, and to
> profile in a cluster or larger data setting.
>
>
>
> On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
> alexander.ulanov@hp.com<ma...@hp.com>> wrote:
>
> Sorry for bothering you again, but I think that it is an important issue
> for applicability of SGD in Spark MLlib. Could Spark developers please
> comment on it.
>
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Monday, March 30, 2015 5:00 PM
> To: dev@spark.apache.org<ma...@spark.apache.org>
> Subject: Stochastic gradient descent performance
>
> Hi,
>
> It seems to me that there is an overhead in "runMiniBatchSGD" function of
> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> might take time that is order of magnitude greater than the actual gradient
> computation. In particular, for mnist dataset of 60K instances, minibatch
> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> aggregate in local mode with 1 data partition on Core i5 processor. The
> actual gradient computation takes 0.002 s. I searched through Spark Jira
> and found that there was recently an update for more efficient sampling
> (SPARK-3250) that is already included in Spark codebase. Is there a way to
> reduce the sampling time and local treeRedeuce by order of magnitude?
>
> Best regards, Alexander
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
> For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>
>
>
>

Re: Stochastic gradient descent performance

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

Yeah, a simple way to estimate the time for an iterative algorithms is
number of iterations required * time per iteration. The time per iteration
will depend on the batch size, computation required and the fixed overheads
I mentioned before. The number of iterations of course depends on the
convergence rate for the problem being solved.

Thanks
Shivaram

On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander <al...@hp.com>
wrote:

>  Hi Shivaram,
>
>
>
> It sounds really interesting! With this time we can estimate if it worth
> considering to run an iterative algorithm on Spark. For example, for SGD on
> Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all
> data by one example not considering the data loading, computation and
> update times. One may need to traverse all data a number of times to
> converge. Let’s say this number is equal to the batch size. So, we remain
> with 62.5 hours overhead. Is it reasonable?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu]
> *Sent:* Thursday, April 02, 2015 1:26 PM
> *To:* Joseph Bradley
> *Cc:* Ulanov, Alexander; dev@spark.apache.org
> *Subject:* Re: Stochastic gradient descent performance
>
>
>
> I haven't looked closely at the sampling issues, but regarding the
> aggregation latency, there are fixed overheads (in local and distributed
> mode) with the way aggregation is done in Spark. Launching a stage of
> tasks, fetching outputs from the previous stage etc. all have overhead, so
> I would say its not efficient / recommended to run stages where computation
> is less than 500ms or so. You could increase your batch size based on this
> and hopefully that will help.
>
>
>
> Regarding reducing these overheads by an order of magnitude it is a
> challenging problem given the architecture in Spark -- I have some ideas
> for this, but they are very much at a research stage.
>
>
>
> Thanks
> Shivaram
>
>
>
> On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
> When you say "It seems that instead of sample it is better to shuffle data
> and then access it sequentially by mini-batches," are you sure that holds
> true for a big dataset in a cluster?  As far as implementing it, I haven't
> looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
> but that looks like it could be modified to be deterministic.
>
> Hopefully someone else can comment on aggregation in local mode.  I'm not
> sure how much effort has gone into optimizing for local mode.
>
> Joseph
>
> On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander <
> alexander.ulanov@hp.com>
> wrote:
>
> >  Hi Joseph,
> >
> >
> >
> > Thank you for suggestion!
> >
> > It seems that instead of sample it is better to shuffle data and then
> > access it sequentially by mini-batches. Could you suggest how to
> implement
> > it?
> >
> >
> >
> > With regards to aggregate (reduce), I am wondering why it works so slow
> in
> > local mode? Could you elaborate on this? I do understand that in cluster
> > mode the network speed will kick in and then one can blame it.
> >
> >
> >
> > Best regards, Alexander
> >
> >
> >
> > *From:* Joseph Bradley [mailto:joseph@databricks.com]
> > *Sent:* Thursday, April 02, 2015 10:51 AM
> > *To:* Ulanov, Alexander
> > *Cc:* dev@spark.apache.org
> > *Subject:* Re: Stochastic gradient descent performance
> >
> >
> >
> > It looks like SPARK-3250 was applied to the sample() which
> GradientDescent
> > uses, and that should kick in for your minibatchFraction <= 0.4.  Based
> on
> > your numbers, aggregation seems like the main issue, though I hesitate to
> > optimize aggregation based on local tests for data sizes that small.
> >
> >
> >
> > The first thing I'd check for is unnecessary object creation, and to
> > profile in a cluster or larger data setting.
> >
> >
> >
> > On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
> > alexander.ulanov@hp.com> wrote:
> >
> > Sorry for bothering you again, but I think that it is an important issue
> > for applicability of SGD in Spark MLlib. Could Spark developers please
> > comment on it.
> >
> >
> > -----Original Message-----
> > From: Ulanov, Alexander
> > Sent: Monday, March 30, 2015 5:00 PM
> > To: dev@spark.apache.org
> > Subject: Stochastic gradient descent performance
> >
> > Hi,
> >
> > It seems to me that there is an overhead in "runMiniBatchSGD" function of
> > MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> > might take time that is order of magnitude greater than the actual
> gradient
> > computation. In particular, for mnist dataset of 60K instances, minibatch
> > size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> > aggregate in local mode with 1 data partition on Core i5 processor. The
> > actual gradient computation takes 0.002 s. I searched through Spark Jira
> > and found that there was recently an update for more efficient sampling
> > (SPARK-3250) that is already included in Spark codebase. Is there a way
> to
> > reduce the sampling time and local treeRedeuce by order of magnitude?
> >
> > Best regards, Alexander
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
> >
> >
>
>
>

RE: Stochastic gradient descent performance

Posted by "Ulanov, Alexander" <al...@hp.com>.

Hi Shivaram,

It sounds really interesting! With this time we can estimate if it worth considering to run an iterative algorithm on Spark. For example, for SGD on Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all data by one example not considering the data loading, computation and update times. One may need to traverse all data a number of times to converge. Let’s say this number is equal to the batch size. So, we remain with 62.5 hours overhead. Is it reasonable?

Best regards, Alexander

From: Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu]
Sent: Thursday, April 02, 2015 1:26 PM
To: Joseph Bradley
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Stochastic gradient descent performance

I haven't looked closely at the sampling issues, but regarding the aggregation latency, there are fixed overheads (in local and distributed mode) with the way aggregation is done in Spark. Launching a stage of tasks, fetching outputs from the previous stage etc. all have overhead, so I would say its not efficient / recommended to run stages where computation is less than 500ms or so. You could increase your batch size based on this and hopefully that will help.

Regarding reducing these overheads by an order of magnitude it is a challenging problem given the architecture in Spark -- I have some ideas for this, but they are very much at a research stage.

Thanks
Shivaram

On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jo...@databricks.com>> wrote:
When you say "It seems that instead of sample it is better to shuffle data
and then access it sequentially by mini-batches," are you sure that holds
true for a big dataset in a cluster?  As far as implementing it, I haven't
looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
but that looks like it could be modified to be deterministic.

Hopefully someone else can comment on aggregation in local mode.  I'm not
sure how much effort has gone into optimizing for local mode.

Joseph

On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander <al...@hp.com>>
wrote:

>  Hi Joseph,
>
>
>
> Thank you for suggestion!
>
> It seems that instead of sample it is better to shuffle data and then
> access it sequentially by mini-batches. Could you suggest how to implement
> it?
>
>
>
> With regards to aggregate (reduce), I am wondering why it works so slow in
> local mode? Could you elaborate on this? I do understand that in cluster
> mode the network speed will kick in and then one can blame it.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Joseph Bradley [mailto:joseph@databricks.com<ma...@databricks.com>]
> *Sent:* Thursday, April 02, 2015 10:51 AM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org<ma...@spark.apache.org>
> *Subject:* Re: Stochastic gradient descent performance
>
>
>
> It looks like SPARK-3250 was applied to the sample() which GradientDescent
> uses, and that should kick in for your minibatchFraction <= 0.4.  Based on
> your numbers, aggregation seems like the main issue, though I hesitate to
> optimize aggregation based on local tests for data sizes that small.
>
>
>
> The first thing I'd check for is unnecessary object creation, and to
> profile in a cluster or larger data setting.
>
>
>
> On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
> alexander.ulanov@hp.com<ma...@hp.com>> wrote:
>
> Sorry for bothering you again, but I think that it is an important issue
> for applicability of SGD in Spark MLlib. Could Spark developers please
> comment on it.
>
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Monday, March 30, 2015 5:00 PM
> To: dev@spark.apache.org<ma...@spark.apache.org>
> Subject: Stochastic gradient descent performance
>
> Hi,
>
> It seems to me that there is an overhead in "runMiniBatchSGD" function of
> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> might take time that is order of magnitude greater than the actual gradient
> computation. In particular, for mnist dataset of 60K instances, minibatch
> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> aggregate in local mode with 1 data partition on Core i5 processor. The
> actual gradient computation takes 0.002 s. I searched through Spark Jira
> and found that there was recently an update for more efficient sampling
> (SPARK-3250) that is already included in Spark codebase. Is there a way to
> reduce the sampling time and local treeRedeuce by order of magnitude?
>
> Best regards, Alexander
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
> For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>
>
>
>

Re: Stochastic gradient descent performance

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

I haven't looked closely at the sampling issues, but regarding the
aggregation latency, there are fixed overheads (in local and distributed
mode) with the way aggregation is done in Spark. Launching a stage of
tasks, fetching outputs from the previous stage etc. all have overhead, so
I would say its not efficient / recommended to run stages where computation
is less than 500ms or so. You could increase your batch size based on this
and hopefully that will help.

Regarding reducing these overheads by an order of magnitude it is a
challenging problem given the architecture in Spark -- I have some ideas
for this, but they are very much at a research stage.

Thanks
Shivaram

On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> When you say "It seems that instead of sample it is better to shuffle data
> and then access it sequentially by mini-batches," are you sure that holds
> true for a big dataset in a cluster?  As far as implementing it, I haven't
> looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
> but that looks like it could be modified to be deterministic.
>
> Hopefully someone else can comment on aggregation in local mode.  I'm not
> sure how much effort has gone into optimizing for local mode.
>
> Joseph
>
> On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander <
> alexander.ulanov@hp.com>
> wrote:
>
> >  Hi Joseph,
> >
> >
> >
> > Thank you for suggestion!
> >
> > It seems that instead of sample it is better to shuffle data and then
> > access it sequentially by mini-batches. Could you suggest how to
> implement
> > it?
> >
> >
> >
> > With regards to aggregate (reduce), I am wondering why it works so slow
> in
> > local mode? Could you elaborate on this? I do understand that in cluster
> > mode the network speed will kick in and then one can blame it.
> >
> >
> >
> > Best regards, Alexander
> >
> >
> >
> > *From:* Joseph Bradley [mailto:joseph@databricks.com]
> > *Sent:* Thursday, April 02, 2015 10:51 AM
> > *To:* Ulanov, Alexander
> > *Cc:* dev@spark.apache.org
> > *Subject:* Re: Stochastic gradient descent performance
> >
> >
> >
> > It looks like SPARK-3250 was applied to the sample() which
> GradientDescent
> > uses, and that should kick in for your minibatchFraction <= 0.4.  Based
> on
> > your numbers, aggregation seems like the main issue, though I hesitate to
> > optimize aggregation based on local tests for data sizes that small.
> >
> >
> >
> > The first thing I'd check for is unnecessary object creation, and to
> > profile in a cluster or larger data setting.
> >
> >
> >
> > On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
> > alexander.ulanov@hp.com> wrote:
> >
> > Sorry for bothering you again, but I think that it is an important issue
> > for applicability of SGD in Spark MLlib. Could Spark developers please
> > comment on it.
> >
> >
> > -----Original Message-----
> > From: Ulanov, Alexander
> > Sent: Monday, March 30, 2015 5:00 PM
> > To: dev@spark.apache.org
> > Subject: Stochastic gradient descent performance
> >
> > Hi,
> >
> > It seems to me that there is an overhead in "runMiniBatchSGD" function of
> > MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> > might take time that is order of magnitude greater than the actual
> gradient
> > computation. In particular, for mnist dataset of 60K instances, minibatch
> > size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> > aggregate in local mode with 1 data partition on Core i5 processor. The
> > actual gradient computation takes 0.002 s. I searched through Spark Jira
> > and found that there was recently an update for more efficient sampling
> > (SPARK-3250) that is already included in Spark codebase. Is there a way
> to
> > reduce the sampling time and local treeRedeuce by order of magnitude?
> >
> > Best regards, Alexander
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
> >
> >
>

Re: Stochastic gradient descent performance

Posted by Joseph Bradley <jo...@databricks.com>.

When you say "It seems that instead of sample it is better to shuffle data
and then access it sequentially by mini-batches," are you sure that holds
true for a big dataset in a cluster?  As far as implementing it, I haven't
looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
but that looks like it could be modified to be deterministic.

Hopefully someone else can comment on aggregation in local mode.  I'm not
sure how much effort has gone into optimizing for local mode.

Joseph

On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander <al...@hp.com>
wrote:

>  Hi Joseph,
>
>
>
> Thank you for suggestion!
>
> It seems that instead of sample it is better to shuffle data and then
> access it sequentially by mini-batches. Could you suggest how to implement
> it?
>
>
>
> With regards to aggregate (reduce), I am wondering why it works so slow in
> local mode? Could you elaborate on this? I do understand that in cluster
> mode the network speed will kick in and then one can blame it.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Joseph Bradley [mailto:joseph@databricks.com]
> *Sent:* Thursday, April 02, 2015 10:51 AM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Stochastic gradient descent performance
>
>
>
> It looks like SPARK-3250 was applied to the sample() which GradientDescent
> uses, and that should kick in for your minibatchFraction <= 0.4.  Based on
> your numbers, aggregation seems like the main issue, though I hesitate to
> optimize aggregation based on local tests for data sizes that small.
>
>
>
> The first thing I'd check for is unnecessary object creation, and to
> profile in a cluster or larger data setting.
>
>
>
> On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
> alexander.ulanov@hp.com> wrote:
>
> Sorry for bothering you again, but I think that it is an important issue
> for applicability of SGD in Spark MLlib. Could Spark developers please
> comment on it.
>
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Monday, March 30, 2015 5:00 PM
> To: dev@spark.apache.org
> Subject: Stochastic gradient descent performance
>
> Hi,
>
> It seems to me that there is an overhead in "runMiniBatchSGD" function of
> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> might take time that is order of magnitude greater than the actual gradient
> computation. In particular, for mnist dataset of 60K instances, minibatch
> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> aggregate in local mode with 1 data partition on Core i5 processor. The
> actual gradient computation takes 0.002 s. I searched through Spark Jira
> and found that there was recently an update for more efficient sampling
> (SPARK-3250) that is already included in Spark codebase. Is there a way to
> reduce the sampling time and local treeRedeuce by order of magnitude?
>
> Best regards, Alexander
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>

RE: Stochastic gradient descent performance

Posted by "Ulanov, Alexander" <al...@hp.com>.

Hi Joseph,

Thank you for suggestion!
It seems that instead of sample it is better to shuffle data and then access it sequentially by mini-batches. Could you suggest how to implement it?

With regards to aggregate (reduce), I am wondering why it works so slow in local mode? Could you elaborate on this? I do understand that in cluster mode the network speed will kick in and then one can blame it.

Best regards, Alexander

From: Joseph Bradley [mailto:joseph@databricks.com]
Sent: Thursday, April 02, 2015 10:51 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Stochastic gradient descent performance

It looks like SPARK-3250 was applied to the sample() which GradientDescent uses, and that should kick in for your minibatchFraction <= 0.4.  Based on your numbers, aggregation seems like the main issue, though I hesitate to optimize aggregation based on local tests for data sizes that small.

The first thing I'd check for is unnecessary object creation, and to profile in a cluster or larger data setting.

On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <al...@hp.com>> wrote:
Sorry for bothering you again, but I think that it is an important issue for applicability of SGD in Spark MLlib. Could Spark developers please comment on it.

-----Original Message-----
From: Ulanov, Alexander
Sent: Monday, March 30, 2015 5:00 PM
To: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Stochastic gradient descent performance

Hi,

It seems to me that there is an overhead in "runMiniBatchSGD" function of MLlib's "GradientDescent". In particular, "sample" and "treeAggregate" might take time that is order of magnitude greater than the actual gradient computation. In particular, for mnist dataset of 60K instances, minibatch size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to aggregate in local mode with 1 data partition on Core i5 processor. The actual gradient computation takes 0.002 s. I searched through Spark Jira and found that there was recently an update for more efficient sampling (SPARK-3250) that is already included in Spark codebase. Is there a way to reduce the sampling time and local treeRedeuce by order of magnitude?

Best regards, Alexander
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

Re: Stochastic gradient descent performance

Posted by Joseph Bradley <jo...@databricks.com>.

It looks like SPARK-3250 was applied to the sample() which GradientDescent
uses, and that should kick in for your minibatchFraction <= 0.4.  Based on
your numbers, aggregation seems like the main issue, though I hesitate to
optimize aggregation based on local tests for data sizes that small.

The first thing I'd check for is unnecessary object creation, and to
profile in a cluster or larger data setting.

On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <al...@hp.com>
wrote:

> Sorry for bothering you again, but I think that it is an important issue
> for applicability of SGD in Spark MLlib. Could Spark developers please
> comment on it.
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Monday, March 30, 2015 5:00 PM
> To: dev@spark.apache.org
> Subject: Stochastic gradient descent performance
>
> Hi,
>
> It seems to me that there is an overhead in "runMiniBatchSGD" function of
> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> might take time that is order of magnitude greater than the actual gradient
> computation. In particular, for mnist dataset of 60K instances, minibatch
> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> aggregate in local mode with 1 data partition on Core i5 processor. The
> actual gradient computation takes 0.002 s. I searched through Spark Jira
> and found that there was recently an update for more efficient sampling
> (SPARK-3250) that is already included in Spark codebase. Is there a way to
> reduce the sampling time and local treeRedeuce by order of magnitude?
>
> Best regards, Alexander
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>