You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/10/22 18:42:00 UTC

[jira] [Updated] (MADLIB-1268) Spike - CNN convergence, data parallel with merge

     [ https://issues.apache.org/jira/browse/MADLIB-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1268:
------------------------------------
    Fix Version/s:     (was: v2.0)
                   v1.16

> Spike - CNN convergence, data parallel with merge
> -------------------------------------------------
>
>                 Key: MADLIB-1268
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1268
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Deep Learning
>            Reporter: Frank McQuillan
>            Assignee: Frank McQuillan
>            Priority: Major
>             Fix For: v1.16
>
>
> Story
> `As a MADlib developer`
> I want investigate convergence behaviour when running a single distributed CNN model across the Greenplum cluster using Keras with a Tensorflow backend
> `so that`
> I can see if it converges in a predictable and expected way.
> Details
> * By "single distributed CNN model" I mean data parallel with merge (not model parallel).
> * Does not need to use an aggregate for this spike, if that is too inconvenient, since performance is not the focus of this story.  It's about convergence.
> * In defining the merge function, review [2] for single-server, multi-GPU merge function.  Perhaps we can do the exact same thing for multi-server?
> * For dataset, consider MNIST and/or CIFAR-10.
> * See page 11 of [8] re synchronous data parallel in TF
> Acceptance
> 1) Plot characteristic curves of loss vs. iteration number.  Compare with MADlib merge (this story) vs. without merge.
> 2) Define what the merge function is for CNN.  Is it the same as [2] or something else? Does it operate on weights only or does it need gradients?
> 3) What does the architecture look like?  Draw a diagram showing sync/merge step for distributed model training.
> 4) What tests do we need to do to convince ourselves that the architecture is valid?  
> 5) Do we need to write different merge functions, or have a different approach, for each different neural net type algorithm?  Or is there a general approach that we can use that will apply to this class of algorithms?
> References
> [2] Check for “# Merge outputs under expected scope” section in the python program
>  https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py
> [5] Single Machine Data Parallel multi GPU Training 
> https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
> [6] Why are GPUs necessary for training Deep Learning models?
> https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/
> [7] Deep Learning vs Classical Machine Learning
> https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa
> [8] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)