You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Eron Wright <ew...@live.com> on 2015/06/08 18:20:03 UTC

[sample code] deeplearning4j for Spark ML (@DeveloperAPI)

The deeplearning4j framework provides a variety of distributed, neural network-based learning algorithms, including convolutional nets, deep auto-encoders, deep-belief nets, and recurrent nets.      We’re working on integration with the Spark ML pipeline, leveraging the developer API.   This announcement is to share some code and get feedback from the Spark community.

The integration code is located in the dl4j-spark-ml module in the deeplearning4j repository.

Major aspects of the integration work:
ML algorithms.  To bind the dl4j algorithms to the ML pipeline, we developed a new classifier and a new unsupervised learning estimator.   
ML attributes. We strove to interoperate well with other pipeline components.   ML Attributes are column-level metadata enabling information sharing between pipeline components.    See here how the classifier reads label metadata from a column provided by the new StringIndexer.
Large binary data.  It is challenging to work with large binary data in Spark.   An effective approach is to leverage PrunedScan and to carefully control partition sizes.  Here we explored this with a custom data source based on the new relation API.   
Column-based record readers.  Here we explored how to construct rows from a Hadoop input split by composing a number of column-level readers, with pruning support.
UDTs.   With Spark SQL it is possible to introduce new data types.   We prototyped an experimental Tensor type, here.
Spark Package.   We developed a spark package to make it easy to use the dl4j framework in spark-shell and with spark-submit.      See the deeplearning4j/dl4j-spark-ml repository for useful snippets involving the sbt-spark-package plugin.
Example code.   Examples demonstrate how the standardized ML API simplifies interoperability, such as with label preprocessing and feature scaling.   See the deeplearning4j/dl4j-spark-ml-examples repository for an expanding set of example pipelines.
Hope this proves useful to the community as we transition to exciting new concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working with multiple GPUs on AWS and we're looking forward to optimizations that will speed neural net training even more. 

Eron Wright
Contributor | deeplearning4j.org



Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

Posted by Xiangrui Meng <me...@gmail.com>.
Hi Eron,

Please register your Spark Package on http://spark-packages.org, which
helps users find your work. Do you have some performance benchmark to
share? Thanks!

Best,
Xiangrui

On Wed, Jun 10, 2015 at 10:48 PM, Nick Pentreath
<ni...@gmail.com> wrote:
> Looks very interesting, thanks for sharing this.
>
> I haven't had much chance to do more than a quick glance over the code.
> Quick question - are the Word2Vec and GLOVE implementations fully parallel
> on Spark?
>
> On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright <ew...@live.com> wrote:
>>
>>
>> The deeplearning4j framework provides a variety of distributed, neural
>> network-based learning algorithms, including convolutional nets, deep
>> auto-encoders, deep-belief nets, and recurrent nets.      We’re working on
>> integration with the Spark ML pipeline, leveraging the developer API.   This
>> announcement is to share some code and get feedback from the Spark
>> community.
>>
>> The integration code is located in the dl4j-spark-ml module in the
>> deeplearning4j repository.
>>
>> Major aspects of the integration work:
>>
>> ML algorithms.  To bind the dl4j algorithms to the ML pipeline, we
>> developed a new classifier and a new unsupervised learning estimator.
>> ML attributes. We strove to interoperate well with other pipeline
>> components.   ML Attributes are column-level metadata enabling information
>> sharing between pipeline components.    See here how the classifier reads
>> label metadata from a column provided by the new StringIndexer.
>> Large binary data.  It is challenging to work with large binary data in
>> Spark.   An effective approach is to leverage PrunedScan and to carefully
>> control partition sizes.  Here we explored this with a custom data source
>> based on the new relation API.
>> Column-based record readers.  Here we explored how to construct rows from
>> a Hadoop input split by composing a number of column-level readers, with
>> pruning support.
>> UDTs.   With Spark SQL it is possible to introduce new data types.   We
>> prototyped an experimental Tensor type, here.
>> Spark Package.   We developed a spark package to make it easy to use the
>> dl4j framework in spark-shell and with spark-submit.      See the
>> deeplearning4j/dl4j-spark-ml repository for useful snippets involving the
>> sbt-spark-package plugin.
>> Example code.   Examples demonstrate how the standardized ML API
>> simplifies interoperability, such as with label preprocessing and feature
>> scaling.   See the deeplearning4j/dl4j-spark-ml-examples repository for an
>> expanding set of example pipelines.
>>
>> Hope this proves useful to the community as we transition to exciting new
>> concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working with
>> multiple GPUs on AWS and we're looking forward to optimizations that will
>> speed neural net training even more.
>>
>> Eron Wright
>> Contributor | deeplearning4j.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

Posted by Nick Pentreath <ni...@gmail.com>.
Looks very interesting, thanks for sharing this.

I haven't had much chance to do more than a quick glance over the code.
Quick question - are the Word2Vec and GLOVE implementations fully parallel
on Spark?

On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright <ew...@live.com> wrote:

>
> The deeplearning4j framework provides a variety of distributed, neural
> network-based learning algorithms, including convolutional nets, deep
> auto-encoders, deep-belief nets, and recurrent nets.      We’re working on
> integration with the Spark ML pipeline, leveraging the developer API.
> This announcement is to share some code and get feedback from the Spark
> community.
>
> The integration code is located in the dl4j-spark-ml module
> <https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml> in
> the deeplearning4j repository.
>
> Major aspects of the integration work:
>
>    1. *ML algorithms.*  To bind the dl4j algorithms to the ML pipeline,
>    we developed a new classifier
>    <https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala> and
>    a new unsupervised learning estimator
>    <https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/Unsupervised.scala>.
>
>    2. *ML attributes.* We strove to interoperate well with other pipeline
>    components.   ML Attributes are column-level metadata enabling information
>    sharing between pipeline components.    See here
>    <https://github.com/deeplearning4j/deeplearning4j/blob/4d33302dd8a792906050eda82a7d50ff77a8d957/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala#L89> how
>    the classifier reads label metadata from a column provided by the new
>    StringIndexer
>    <http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>
>    .
>    3. *Large binary data.*  It is challenging to work with large binary
>    data in Spark.   An effective approach is to leverage PrunedScan and to
>    carefully control partition sizes.  Here
>    <https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala> we
>    explored this with a custom data source based on the new relation API.
>    4. *Column-based record readers.*  Here
>    <https://github.com/deeplearning4j/deeplearning4j/blob/b237385b56d42d24bd3c99d1eece6cb658f387f2/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala#L96> we
>    explored how to construct rows from a Hadoop input split by composing a
>    number of column-level readers, with pruning support.
>    5. *UDTs*.   With Spark SQL it is possible to introduce new data
>    types.   We prototyped an experimental Tensor type, here
>    <https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/types/tensors.scala>
>    .
>    6. *Spark Package.*   We developed a spark package to make it easy to
>    use the dl4j framework in spark-shell and with spark-submit.      See the
>    deeplearning4j/dl4j-spark-ml
>    <https://github.com/deeplearning4j/dl4j-spark-ml> repository for
>    useful snippets involving the sbt-spark-package plugin.
>    7. *Example code.*   Examples demonstrate how the standardized ML API
>    simplifies interoperability, such as with label preprocessing and feature
>    scaling.   See the deeplearning4j/dl4j-spark-ml-examples
>    <https://github.com/deeplearning4j/dl4j-spark-ml-examples> repository
>    for an expanding set of example pipelines.
>
> Hope this proves useful to the community as we transition to exciting new
> concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working
> with multiple GPUs on AWS <http://deeplearning4j.org/gpu_aws.html> and
> we're looking forward to optimizations that will speed neural net training
> even more.
>
> Eron Wright
> Contributor | deeplearning4j.org
>
>