You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by avulanov <gi...@git.apache.org> on 2016/06/12 01:31:51 UTC

[GitHub] spark pull request #13621: [SPARK-2623] [ML] Implement stacked autoencoder

GitHub user avulanov opened a pull request:

    https://github.com/apache/spark/pull/13621

    [SPARK-2623] [ML] Implement stacked autoencoder

    ## What changes were proposed in this pull request?
    Implement stacked autoencoder
    - Base on ml.ann Layer and LossFunction
    - Implement two loss functions `EmptyLayerWithSquaredError` and `SigmoidLayerWithSquaredError` to handle inputs (-inf, +inf) and [0, 1]
    - Implement greedy training
    - Provide encoder and decoder
    
    ## How was this patch tested?
    Provide unit tests
    - Gradient correctness of the new LossFunctions
    - Correct reconstruction of the original data by encoding and decoding (based on Berkeley's CS182)
    - Successful pre-training of deep network with 6 hidden layers

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/avulanov/spark autoencoder-mlp

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13621.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13621
    
----
commit adc81ba1f1b6fb014bb1813de3ab283f841585d5
Author: avulanov <av...@gmail.com>
Date:   2016-04-04T23:06:25Z

    Implement stacked autoencoder
    - Base on ml.ann Layer and LossFunction
    - Implement two new loss functions EmptyLayerWithSquaredError and SigmoidLayerWithSquaredError to handle inputs [-inf, +inf] and [0, 1]
    - Implement greedy training
    - Provide encoder and decoder

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    Added this feature to the Spark [scalable-deeplearning package](https://spark-packages.org/package/avulanov/scalable-deeplearning). @sethah Could you take a look? Also, it would be great to add ReLu as you suggested. This package is intended for new features that were not yet merged to Spark ML or that are too experimental to be merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the issue:

https://github.com/apache/spark/pull/13621

@sethah Thank you for posting the result of your experiment! It looks interesting. It is hard to say how good does it work without numerical results for a particular application, such as e.g. classification error rate. Could you compute the classification error rate on the mnist test data with and without autoencoder pre-training in Spark? I did this a while ago for the network with two hidden layers with 300 and 100 neurons. Autoencoder allowed to improve over standard training and reach the error rate reported in http://yann.lecun.com/exdb/mnist/. The other useful application of autoencoder is unsupervised learning. In this case, it will be interesting to compare the losses for sigmoid and relu autoencoders on the validation set. Would you mind checking this?

Autoencoder is also used to pre-train deep networks that does not converge otherwise due to vanishing gradient issue. There is an example of this use-case in the unit test.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    **[Test build #60350 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60350/consoleFull)** for PR 13621 at commit [`adc81ba`](https://github.com/apache/spark/commit/adc81ba1f1b6fb014bb1813de3ab283f841585d5).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class StackedAutoencoder (override val uid: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    **[Test build #60350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60350/consoleFull)** for PR 13621 at commit [`adc81ba`](https://github.com/apache/spark/commit/adc81ba1f1b6fb014bb1813de3ab283f841585d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    **[Test build #60351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60351/consoleFull)** for PR 13621 at commit [`b3f5539`](https://github.com/apache/spark/commit/b3f5539b45f86309b17394a9f7ba88d82dcd124f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    **[Test build #60351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60351/consoleFull)** for PR 13621 at commit [`b3f5539`](https://github.com/apache/spark/commit/b3f5539b45f86309b17394a9f7ba88d82dcd124f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-10408] [ML] Implement stacked autoencoder

Posted by JeremyNixon <gi...@git.apache.org>.

Github user JeremyNixon commented on the issue:

https://github.com/apache/spark/pull/13621

I ran the Keras experiment with code up at [[GitHub link] ](https://github.com/JeremyNixon/autoencoder) if anyone wants to build on this or replicate it.

Running Seth’s example on the training data set, I was able to get the results below.

![screen shot 2017-05-11 at 10 08 37 pm](https://cloud.githubusercontent.com/assets/4738024/25979615/9567a8bc-3697-11e7-81ed-be3fd073f4c5.png)

I agree that we should add modern activation functions. More importantly, we should add improved optimizers and a modular API to make this valuable to real users.

I’m going to do a code review here and at scalable-deeplearning in the next few days regardless of the decision we make around this. I think that these improvements (activation functions, optimizers) should be a part of a flexible modular library if we want to give users a modern experience.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60351/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/13621

@avulanov

I used this implementation to run a simple single layer autoencoder on the MNIST dataset. I also used keras/theano to implement the same autoencoder and run on the MNIST data. With Spark, I got very poor results. First, here are the results of encode/decode using Keras with a cross entropy loss function on the output, and sigmoid activations.

![image](https://cloud.githubusercontent.com/assets/7275795/17375073/59b14faa-5964-11e6-943f-d2e1db06089d.png)

The implementation in this patch yielded very similar results.

![image](https://cloud.githubusercontent.com/assets/7275795/17374543/fb923c42-5961-11e6-8c97-dfa7626c4cc3.png)

Finally, here is the Keras implementation using RELU activations.

![image](https://cloud.githubusercontent.com/assets/7275795/17375464/ebe1b8d2-5965-11e6-964f-fa8cc1c2a4f5.png)

It appears the sigmoid activations are saturating during training and preventing the algorithm from learning. If you have any thoughts/suggestions to improve these results I'd really appreciate it.

Does it make sense to add another algorithm based on MLP/NN when the current functionality is so limited? If the autoencoder library is not useful without more than sigmoid activations, I'd vote for focusing on adding new activations before another algorithm. I'm not an expert here, so I would really appreciate your thoughts. Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    @mengxr @jkbradley could you take a look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    cc @JeremyNixon also


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    I realize I was a bit unclear now. The results above are from training a single layer autoencoder and using it to reconstruct the original data. I used an encoding layer of 32 neurons so the results above are generated from 1.) encoding 784 dimension input to 32 dimension encoded input and 2.) decoding the 32 dimension vector to 784 dimensions. I will try to work on getting some specific numbers and do pre-training. For now, I wanted to point out that we get poor performance with sigmoid units and discuss where the short-term focus for deep learning in Spark should be. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60350/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org