You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/06/04 06:24:00 UTC
[jira] [Comment Edited] (MADLIB-1210) Add momentum methods to MLP

    [ https://issues.apache.org/jira/browse/MADLIB-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498678#comment-16498678 ] 

Frank McQuillan edited comment on MADLIB-1210 at 6/4/18 6:23 AM:
-----------------------------------------------------------------

I uploaded the momentum comparison with MNIST data set (the other comparison excel file from before is for 2D Rosenbrock function).  

Some observations on curves for MNIST dataset:

- momentum not useful
- but if you do use momentum, mini-batch is better than SGD

I think this is sufficient for the testing we need to run for this story


was (Author: fmcquillan):
I uploaded the momentum comparison with MNIST data set (the other comparison excel file from before is for 2D Rosenbrock function).  

Some observations on curves for MNIST dataset:

- momentum not useful, regular SGD works better
- if using momentum, mini-batch is better than SGD

I think this is sufficient for the testing we need to run for this story

> Add momentum methods to MLP
> ---------------------------
>
>                 Key: MADLIB-1210
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1210
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Neural Networks
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.15
>
>         Attachments: Momentum methods comparison.xlsx, momentum with MNIST dataset.pdf
>
>
> Story
> As a data scientist,
> I want to use momentum methods in MLP,
> so that I get significantly better convergence behavior.
> Details
> Adding momentum will get the MADlib MLP algorithm closer to state of the art.
> 1) Implement momentum term, default value ~0.9
> Ref [1]:
> "Momentum update is another approach that almost always enjoys better converge rates on deep networks." 
> 2) Implement Nesterov momentum, default TRUE
> Ref [1]:
> "Nesterov Momentum is a slightly different version of the momentum update that has recently been gaining popularity. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistently works slightly better than standard momentum."
> Ref [2]
> "Nesterov’s accelerated gradient (abbrv. NAG; Nesterov, 1983) is a first-order optimization method which is proven to have a better convergence rate guarantee than gradient descent for general convex functions with Lipshitz-continuous derivatives (O(1/T2) versus O(1/T))"
> Interface
> There are 2 new optimization params for momentum, which apply for both 
> classification and regression:
> {code}
> 'learning_rate_init = <value>,
> learning_rate_policy = <value>,
> gamma = <value>,
> power = <value>,
> iterations_per_step = <value>,
> n_iterations = <value>,
> n_tries = <value>,
> lambda = <value>,
> tolerance = <value>,
> batch_size = <value>,
> n_epochs = <value>,
> momentum = <value>,
> nesterov= <value>'
> momentum
> FLOAT8, default: 0.9. Momentum can help accelerate learning and 
> avoid local minima when using gradient descent. Value must be in the 
> range 0 to 1, where 0 means no momentum.
> nesterov
> BOOLEAN, default: TRUE. Nesterov momentum can provide better results than using
> classical momentum alone, due to its look ahead characteristics.  
> In classical momentum you first correct velocity and step with that 
> velocity, whereas in Nesterov momentum you first step in the velocity 
> direction then make a correction to the velocity vector based on 
> new location.
> Nesterov momentum is only used when the 'momentum' parameter is > 0.
> {code}
> Open questions
> 1) Does momentum and Nesterov momentum work equally well with and without mini-batching?
> Is there any guidance we need to give to users on this?
> Acceptance
> [1] Compare the usefulness of momentum with and without Nesterov, and SGD (i.e., 3 comparisons).  Use a 2D Rosenbrock function to compare in a similar way to test ref [100] in the comment further down, i.e., loss by iteration number.  Maybe try a few different  2D slices (starting points)
> [2] Test with MNIST.  Please generate characteristic curves of loss vs. iteration number, similar to what was done for Rosenbrock.
> [3] Report out momentum value and Nesterov in the output summary table.
> References
> [1] http://cs231n.github.io/neural-networks-3/#sgd
> [2] http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf, a link from previous source.
> [3] http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentoptimizationalgorithms
> [4] http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
> [5] https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)