You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Thamme Gowda <tg...@gmail.com> on 2017/07/05 01:48:39 UTC

Document Categorizer based on Glove + LSTM (powered by DL4J)

Hello OpenNLP Devs,

I am working with text classification using word embeddings like
Gloves/Word2Vec and LSTM networks.
It will be interesting to see if we can use it as document categorizer,
especially for sentiment analysis in OpenNLP.

I have already raised a PR to the sandbox repo -
https://github.com/apache/opennlp-sandbox/pull/3

This is first version, and I expect to receive feedback from Dev community
to make it work for everyone.

Here are the design choices I have made for the initial version:

   - Using pre-trained Gloves - I felt the glove vector format is clean,
   easily customizable in terms of dimensions and vocabulary size, and (also I
   have been reading a lot about them from Stanford NLP group).
      - Training Gloves isnt hard either, we can do it using the original C
      library as well as by using DL4J.
      - Using DL4J's Multi layer networks with LSTM instead of reinventing
   this stuff again on JVM for OpenNLP


Please share your feedback here or on the github page
https://github.com/apache/opennlp-sandbox/pull/3 .


Thanks,
TG


--
*Thamme Gowda *
@thammegowda <https://twitter.com/thammegowda> |
http://scf.usc.edu/~tnarayan/
~Sent via somebody's Webmail server

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Joern Kottmann <ko...@gmail.com>.

It would be really great if you could implement doccat format support
for the Stanford Large Moview Review dataset, that way we can also
easily train the normal doccat component with it. We should open a
jira for that.

Jörn

On Wed, Jul 5, 2017 at 7:29 PM, Thamme Gowda <tg...@gmail.com> wrote:
> Got it, Thanks. We will do it.
>
> On Jul 5, 2017 9:43 AM, "Chris Mattmann" <ma...@apache.org> wrote:
>
> Thanks Thamme.
>
> Please train on the datasets for sentiment analysis described here so we
> can align
> with the standard DocCat training I’m doing for sentiment analysis post
> 1.8.1.
>
> http://irds.usc.edu/SentimentAnalysisParser/datasets.html
>
> Thanks!
>
> Cheers,
> Chris
>
>
>
>
> On 7/5/17, 9:34 AM, "Thamme Gowda" <th...@apache.org> wrote:
>
>     @Tomasso  @Jörn
>     Thanks. I will update the PR by making it implement Doccat API.
>
>     @Rodrigo
>     I have not yet tested on the full Stanford Large Movie Review dataset.
> It
>     takes more time to train, perhaps a few days for multiple passes on the
>     entire dataset (on my i5 CPU, no GPUs at the moment).
>     I had trained models (multiple times) with 3000 examples (1500 pos, 1500
>     neg)  for two epochs, the F1 was approximately 0.70.
>     I plan to train on the complete dataset sometime down the line and tune
> the
>     network with more layers (that is the fun part). This PR is like
> setting up
>     the infrastructure for it.
>
>     @Chris
>     Hi Prof. Thanks for the kind words! Just getting started with my new job
>     here - more NLP and Machine Translation stuff to come.
>
>     -Thamme
>
>     On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann <ma...@apache.org>
> wrote:
>
>     > Thamme, great job!
>     >
>     > (proud academic dad)
>     >
>     > Cheers,
>     > Chris
>     >
>     >
>     >
>     >
>     > On 7/5/17, 12:31 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
>     >
>     >     +1 to merge this when it implements the Document Categorizer,
> then we
>     >     can also use those tools to train and evaluate it
>     >
>     >     Jörn
>     >
>     >     On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <ragerri@apache.org
>>
>     > wrote:
>     >     > Hello again,
>     >     >
>     >     > @Thamme, out of curiosity, do you have evaluation numbers on the
>     >     > Stanford Large Movie Review dataset?
>     >     >
>     >     > Best,
>     >     >
>     >     > Rodrigo
>     >     >
>     >     > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <
> ragerri@apache.org>
>     > wrote:
>     >     >> +1 to Tommaso's comment. This would be very nice to have in the
>     > project.
>     >     >>
>     >     >> R
>     >     >>
>     >     >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
>     >     >> <to...@gmail.com> wrote:
>     >     >>> thanks Thamme for bringing this to the list!
>     >     >>>
>     >     >>>
>     >     >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
>     > tgowdan@gmail.com> ha
>     >     >>> scritto:
>     >     >>>
>     >     >>>> Hello OpenNLP Devs,
>     >     >>>>
>     >     >>>> I am working with text classification using word embeddings
> like
>     >     >>>> Gloves/Word2Vec and LSTM networks.
>     >     >>>> It will be interesting to see if we can use it as document
>     > categorizer,
>     >     >>>> especially for sentiment analysis in OpenNLP.
>     >     >>>>
>     >     >>>> I have already raised a PR to the sandbox repo -
>     >     >>>> https://github.com/apache/opennlp-sandbox/pull/3
>     >     >>>>
>     >     >>>> This is first version, and I expect to receive feedback from
> Dev
>     > community
>     >     >>>> to make it work for everyone.
>     >     >>>>
>     >     >>>> Here are the design choices I have made for the initial
> version:
>     >     >>>>
>     >     >>>>    - Using pre-trained Gloves - I felt the glove vector
> format is
>     > clean,
>     >     >>>>    easily customizable in terms of dimensions and vocabulary
>     > size, and
>     >     >>>> (also I
>     >     >>>>    have been reading a lot about them from Stanford NLP
> group).
>     >     >>>>       - Training Gloves isnt hard either, we can do it using
> the
>     > original C
>     >     >>>>       library as well as by using DL4J.
>     >     >>>>       - Using DL4J's Multi layer networks with LSTM instead
> of
>     > reinventing
>     >     >>>>    this stuff again on JVM for OpenNLP
>     >     >>>>
>     >     >>>>
>     >     >>>> Please share your feedback here or on the github page
>     >     >>>> https://github.com/apache/opennlp-sandbox/pull/3 .
>     >     >>>>
>     >     >>>>
>     >     >>> I think the approach outlined here sounds good, I think we
> could
>     >     >>> incorporate the PR as soon as it implements the Doccat API.
>     >     >>> Then we may see whether and how it makes sense to adjust it
> to use
>     > other
>     >     >>> types of embeddings (e.g. paragraph vectors) and / or
> different
>     > network
>     >     >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>     >     >>>
>     >     >>> Looking forward to see this move forward,
>     >     >>> Regards,
>     >     >>> Tommaso
>     >     >>>
>     >     >>>
>     >     >>>>
>     >     >>>> Thanks,
>     >     >>>> TG
>     >     >>>>
>     >     >>>>
>     >     >>>> --
>     >     >>>> *Thamme Gowda *
>     >     >>>> @thammegowda <https://twitter.com/thammegowda> |
>     >     >>>> http://scf.usc.edu/~tnarayan/
>     >     >>>> ~Sent via somebody's Webmail server
>     >     >>>>
>     >
>     >
>     >
>     >

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Thamme Gowda <tg...@gmail.com>.

Got it, Thanks. We will do it.

On Jul 5, 2017 9:43 AM, "Chris Mattmann" <ma...@apache.org> wrote:

Thanks Thamme.

Please train on the datasets for sentiment analysis described here so we
can align
with the standard DocCat training I’m doing for sentiment analysis post
1.8.1.

http://irds.usc.edu/SentimentAnalysisParser/datasets.html

Thanks!

Cheers,
Chris




On 7/5/17, 9:34 AM, "Thamme Gowda" <th...@apache.org> wrote:

    @Tomasso  @Jörn
    Thanks. I will update the PR by making it implement Doccat API.

    @Rodrigo
    I have not yet tested on the full Stanford Large Movie Review dataset.
It
    takes more time to train, perhaps a few days for multiple passes on the
    entire dataset (on my i5 CPU, no GPUs at the moment).
    I had trained models (multiple times) with 3000 examples (1500 pos, 1500
    neg)  for two epochs, the F1 was approximately 0.70.
    I plan to train on the complete dataset sometime down the line and tune
the
    network with more layers (that is the fun part). This PR is like
setting up
    the infrastructure for it.

    @Chris
    Hi Prof. Thanks for the kind words! Just getting started with my new job
    here - more NLP and Machine Translation stuff to come.

    -Thamme

    On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann <ma...@apache.org>
wrote:

    > Thamme, great job!
    >
    > (proud academic dad)
    >
    > Cheers,
    > Chris
    >
    >
    >
    >
    > On 7/5/17, 12:31 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
    >
    >     +1 to merge this when it implements the Document Categorizer,
then we
    >     can also use those tools to train and evaluate it
    >
    >     Jörn
    >
    >     On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <ragerri@apache.org
>
    > wrote:
    >     > Hello again,
    >     >
    >     > @Thamme, out of curiosity, do you have evaluation numbers on the
    >     > Stanford Large Movie Review dataset?
    >     >
    >     > Best,
    >     >
    >     > Rodrigo
    >     >
    >     > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <
ragerri@apache.org>
    > wrote:
    >     >> +1 to Tommaso's comment. This would be very nice to have in the
    > project.
    >     >>
    >     >> R
    >     >>
    >     >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
    >     >> <to...@gmail.com> wrote:
    >     >>> thanks Thamme for bringing this to the list!
    >     >>>
    >     >>>
    >     >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
    > tgowdan@gmail.com> ha
    >     >>> scritto:
    >     >>>
    >     >>>> Hello OpenNLP Devs,
    >     >>>>
    >     >>>> I am working with text classification using word embeddings
like
    >     >>>> Gloves/Word2Vec and LSTM networks.
    >     >>>> It will be interesting to see if we can use it as document
    > categorizer,
    >     >>>> especially for sentiment analysis in OpenNLP.
    >     >>>>
    >     >>>> I have already raised a PR to the sandbox repo -
    >     >>>> https://github.com/apache/opennlp-sandbox/pull/3
    >     >>>>
    >     >>>> This is first version, and I expect to receive feedback from
Dev
    > community
    >     >>>> to make it work for everyone.
    >     >>>>
    >     >>>> Here are the design choices I have made for the initial
version:
    >     >>>>
    >     >>>>    - Using pre-trained Gloves - I felt the glove vector
format is
    > clean,
    >     >>>>    easily customizable in terms of dimensions and vocabulary
    > size, and
    >     >>>> (also I
    >     >>>>    have been reading a lot about them from Stanford NLP
group).
    >     >>>>       - Training Gloves isnt hard either, we can do it using
the
    > original C
    >     >>>>       library as well as by using DL4J.
    >     >>>>       - Using DL4J's Multi layer networks with LSTM instead
of
    > reinventing
    >     >>>>    this stuff again on JVM for OpenNLP
    >     >>>>
    >     >>>>
    >     >>>> Please share your feedback here or on the github page
    >     >>>> https://github.com/apache/opennlp-sandbox/pull/3 .
    >     >>>>
    >     >>>>
    >     >>> I think the approach outlined here sounds good, I think we
could
    >     >>> incorporate the PR as soon as it implements the Doccat API.
    >     >>> Then we may see whether and how it makes sense to adjust it
to use
    > other
    >     >>> types of embeddings (e.g. paragraph vectors) and / or
different
    > network
    >     >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
    >     >>>
    >     >>> Looking forward to see this move forward,
    >     >>> Regards,
    >     >>> Tommaso
    >     >>>
    >     >>>
    >     >>>>
    >     >>>> Thanks,
    >     >>>> TG
    >     >>>>
    >     >>>>
    >     >>>> --
    >     >>>> *Thamme Gowda *
    >     >>>> @thammegowda <https://twitter.com/thammegowda> |
    >     >>>> http://scf.usc.edu/~tnarayan/
    >     >>>> ~Sent via somebody's Webmail server
    >     >>>>
    >
    >
    >
    >

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Chris Mattmann <ma...@apache.org>.

Thanks Thamme.

Please train on the datasets for sentiment analysis described here so we can align
with the standard DocCat training I’m doing for sentiment analysis post 1.8.1.

http://irds.usc.edu/SentimentAnalysisParser/datasets.html 

Thanks!

Cheers,
Chris




On 7/5/17, 9:34 AM, "Thamme Gowda" <th...@apache.org> wrote:

    @Tomasso  @Jörn
    Thanks. I will update the PR by making it implement Doccat API.
    
    @Rodrigo
    I have not yet tested on the full Stanford Large Movie Review dataset. It
    takes more time to train, perhaps a few days for multiple passes on the
    entire dataset (on my i5 CPU, no GPUs at the moment).
    I had trained models (multiple times) with 3000 examples (1500 pos, 1500
    neg)  for two epochs, the F1 was approximately 0.70.
    I plan to train on the complete dataset sometime down the line and tune the
    network with more layers (that is the fun part). This PR is like setting up
    the infrastructure for it.
    
    @Chris
    Hi Prof. Thanks for the kind words! Just getting started with my new job
    here - more NLP and Machine Translation stuff to come.
    
    -Thamme
    
    On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann <ma...@apache.org> wrote:
    
    > Thamme, great job!
    >
    > (proud academic dad)
    >
    > Cheers,
    > Chris
    >
    >
    >
    >
    > On 7/5/17, 12:31 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
    >
    >     +1 to merge this when it implements the Document Categorizer, then we
    >     can also use those tools to train and evaluate it
    >
    >     Jörn
    >
    >     On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <ra...@apache.org>
    > wrote:
    >     > Hello again,
    >     >
    >     > @Thamme, out of curiosity, do you have evaluation numbers on the
    >     > Stanford Large Movie Review dataset?
    >     >
    >     > Best,
    >     >
    >     > Rodrigo
    >     >
    >     > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <ra...@apache.org>
    > wrote:
    >     >> +1 to Tommaso's comment. This would be very nice to have in the
    > project.
    >     >>
    >     >> R
    >     >>
    >     >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
    >     >> <to...@gmail.com> wrote:
    >     >>> thanks Thamme for bringing this to the list!
    >     >>>
    >     >>>
    >     >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
    > tgowdan@gmail.com> ha
    >     >>> scritto:
    >     >>>
    >     >>>> Hello OpenNLP Devs,
    >     >>>>
    >     >>>> I am working with text classification using word embeddings like
    >     >>>> Gloves/Word2Vec and LSTM networks.
    >     >>>> It will be interesting to see if we can use it as document
    > categorizer,
    >     >>>> especially for sentiment analysis in OpenNLP.
    >     >>>>
    >     >>>> I have already raised a PR to the sandbox repo -
    >     >>>> https://github.com/apache/opennlp-sandbox/pull/3
    >     >>>>
    >     >>>> This is first version, and I expect to receive feedback from Dev
    > community
    >     >>>> to make it work for everyone.
    >     >>>>
    >     >>>> Here are the design choices I have made for the initial version:
    >     >>>>
    >     >>>>    - Using pre-trained Gloves - I felt the glove vector format is
    > clean,
    >     >>>>    easily customizable in terms of dimensions and vocabulary
    > size, and
    >     >>>> (also I
    >     >>>>    have been reading a lot about them from Stanford NLP group).
    >     >>>>       - Training Gloves isnt hard either, we can do it using the
    > original C
    >     >>>>       library as well as by using DL4J.
    >     >>>>       - Using DL4J's Multi layer networks with LSTM instead of
    > reinventing
    >     >>>>    this stuff again on JVM for OpenNLP
    >     >>>>
    >     >>>>
    >     >>>> Please share your feedback here or on the github page
    >     >>>> https://github.com/apache/opennlp-sandbox/pull/3 .
    >     >>>>
    >     >>>>
    >     >>> I think the approach outlined here sounds good, I think we could
    >     >>> incorporate the PR as soon as it implements the Doccat API.
    >     >>> Then we may see whether and how it makes sense to adjust it to use
    > other
    >     >>> types of embeddings (e.g. paragraph vectors) and / or different
    > network
    >     >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
    >     >>>
    >     >>> Looking forward to see this move forward,
    >     >>> Regards,
    >     >>> Tommaso
    >     >>>
    >     >>>
    >     >>>>
    >     >>>> Thanks,
    >     >>>> TG
    >     >>>>
    >     >>>>
    >     >>>> --
    >     >>>> *Thamme Gowda *
    >     >>>> @thammegowda <https://twitter.com/thammegowda> |
    >     >>>> http://scf.usc.edu/~tnarayan/
    >     >>>> ~Sent via somebody's Webmail server
    >     >>>>
    >
    >
    >
    >

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Thamme Gowda <th...@apache.org>.

@Tomasso  @Jörn
Thanks. I will update the PR by making it implement Doccat API.

@Rodrigo
I have not yet tested on the full Stanford Large Movie Review dataset. It
takes more time to train, perhaps a few days for multiple passes on the
entire dataset (on my i5 CPU, no GPUs at the moment).
I had trained models (multiple times) with 3000 examples (1500 pos, 1500
neg)  for two epochs, the F1 was approximately 0.70.
I plan to train on the complete dataset sometime down the line and tune the
network with more layers (that is the fun part). This PR is like setting up
the infrastructure for it.

@Chris
Hi Prof. Thanks for the kind words! Just getting started with my new job
here - more NLP and Machine Translation stuff to come.

-Thamme

On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann <ma...@apache.org> wrote:

> Thamme, great job!
>
> (proud academic dad)
>
> Cheers,
> Chris
>
>
>
>
> On 7/5/17, 12:31 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
>
>     +1 to merge this when it implements the Document Categorizer, then we
>     can also use those tools to train and evaluate it
>
>     Jörn
>
>     On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <ra...@apache.org>
> wrote:
>     > Hello again,
>     >
>     > @Thamme, out of curiosity, do you have evaluation numbers on the
>     > Stanford Large Movie Review dataset?
>     >
>     > Best,
>     >
>     > Rodrigo
>     >
>     > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <ra...@apache.org>
> wrote:
>     >> +1 to Tommaso's comment. This would be very nice to have in the
> project.
>     >>
>     >> R
>     >>
>     >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
>     >> <to...@gmail.com> wrote:
>     >>> thanks Thamme for bringing this to the list!
>     >>>
>     >>>
>     >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
> tgowdan@gmail.com> ha
>     >>> scritto:
>     >>>
>     >>>> Hello OpenNLP Devs,
>     >>>>
>     >>>> I am working with text classification using word embeddings like
>     >>>> Gloves/Word2Vec and LSTM networks.
>     >>>> It will be interesting to see if we can use it as document
> categorizer,
>     >>>> especially for sentiment analysis in OpenNLP.
>     >>>>
>     >>>> I have already raised a PR to the sandbox repo -
>     >>>> https://github.com/apache/opennlp-sandbox/pull/3
>     >>>>
>     >>>> This is first version, and I expect to receive feedback from Dev
> community
>     >>>> to make it work for everyone.
>     >>>>
>     >>>> Here are the design choices I have made for the initial version:
>     >>>>
>     >>>>    - Using pre-trained Gloves - I felt the glove vector format is
> clean,
>     >>>>    easily customizable in terms of dimensions and vocabulary
> size, and
>     >>>> (also I
>     >>>>    have been reading a lot about them from Stanford NLP group).
>     >>>>       - Training Gloves isnt hard either, we can do it using the
> original C
>     >>>>       library as well as by using DL4J.
>     >>>>       - Using DL4J's Multi layer networks with LSTM instead of
> reinventing
>     >>>>    this stuff again on JVM for OpenNLP
>     >>>>
>     >>>>
>     >>>> Please share your feedback here or on the github page
>     >>>> https://github.com/apache/opennlp-sandbox/pull/3 .
>     >>>>
>     >>>>
>     >>> I think the approach outlined here sounds good, I think we could
>     >>> incorporate the PR as soon as it implements the Doccat API.
>     >>> Then we may see whether and how it makes sense to adjust it to use
> other
>     >>> types of embeddings (e.g. paragraph vectors) and / or different
> network
>     >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>     >>>
>     >>> Looking forward to see this move forward,
>     >>> Regards,
>     >>> Tommaso
>     >>>
>     >>>
>     >>>>
>     >>>> Thanks,
>     >>>> TG
>     >>>>
>     >>>>
>     >>>> --
>     >>>> *Thamme Gowda *
>     >>>> @thammegowda <https://twitter.com/thammegowda> |
>     >>>> http://scf.usc.edu/~tnarayan/
>     >>>> ~Sent via somebody's Webmail server
>     >>>>
>
>
>
>

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Chris Mattmann <ma...@apache.org>.

Thamme, great job! 

(proud academic dad)

Cheers,
Chris




On 7/5/17, 12:31 AM, "Joern Kottmann" <ko...@gmail.com> wrote:

    +1 to merge this when it implements the Document Categorizer, then we
    can also use those tools to train and evaluate it
    
    Jörn
    
    On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <ra...@apache.org> wrote:
    > Hello again,
    >
    > @Thamme, out of curiosity, do you have evaluation numbers on the
    > Stanford Large Movie Review dataset?
    >
    > Best,
    >
    > Rodrigo
    >
    > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <ra...@apache.org> wrote:
    >> +1 to Tommaso's comment. This would be very nice to have in the project.
    >>
    >> R
    >>
    >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
    >> <to...@gmail.com> wrote:
    >>> thanks Thamme for bringing this to the list!
    >>>
    >>>
    >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <tg...@gmail.com> ha
    >>> scritto:
    >>>
    >>>> Hello OpenNLP Devs,
    >>>>
    >>>> I am working with text classification using word embeddings like
    >>>> Gloves/Word2Vec and LSTM networks.
    >>>> It will be interesting to see if we can use it as document categorizer,
    >>>> especially for sentiment analysis in OpenNLP.
    >>>>
    >>>> I have already raised a PR to the sandbox repo -
    >>>> https://github.com/apache/opennlp-sandbox/pull/3
    >>>>
    >>>> This is first version, and I expect to receive feedback from Dev community
    >>>> to make it work for everyone.
    >>>>
    >>>> Here are the design choices I have made for the initial version:
    >>>>
    >>>>    - Using pre-trained Gloves - I felt the glove vector format is clean,
    >>>>    easily customizable in terms of dimensions and vocabulary size, and
    >>>> (also I
    >>>>    have been reading a lot about them from Stanford NLP group).
    >>>>       - Training Gloves isnt hard either, we can do it using the original C
    >>>>       library as well as by using DL4J.
    >>>>       - Using DL4J's Multi layer networks with LSTM instead of reinventing
    >>>>    this stuff again on JVM for OpenNLP
    >>>>
    >>>>
    >>>> Please share your feedback here or on the github page
    >>>> https://github.com/apache/opennlp-sandbox/pull/3 .
    >>>>
    >>>>
    >>> I think the approach outlined here sounds good, I think we could
    >>> incorporate the PR as soon as it implements the Doccat API.
    >>> Then we may see whether and how it makes sense to adjust it to use other
    >>> types of embeddings (e.g. paragraph vectors) and / or different network
    >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
    >>>
    >>> Looking forward to see this move forward,
    >>> Regards,
    >>> Tommaso
    >>>
    >>>
    >>>>
    >>>> Thanks,
    >>>> TG
    >>>>
    >>>>
    >>>> --
    >>>> *Thamme Gowda *
    >>>> @thammegowda <https://twitter.com/thammegowda> |
    >>>> http://scf.usc.edu/~tnarayan/
    >>>> ~Sent via somebody's Webmail server
    >>>>

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Joern Kottmann <ko...@gmail.com>.

+1 to merge this when it implements the Document Categorizer, then we
can also use those tools to train and evaluate it

Jörn

On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <ra...@apache.org> wrote:
> Hello again,
>
> @Thamme, out of curiosity, do you have evaluation numbers on the
> Stanford Large Movie Review dataset?
>
> Best,
>
> Rodrigo
>
> On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <ra...@apache.org> wrote:
>> +1 to Tommaso's comment. This would be very nice to have in the project.
>>
>> R
>>
>> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
>> <to...@gmail.com> wrote:
>>> thanks Thamme for bringing this to the list!
>>>
>>>
>>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <tg...@gmail.com> ha
>>> scritto:
>>>
>>>> Hello OpenNLP Devs,
>>>>
>>>> I am working with text classification using word embeddings like
>>>> Gloves/Word2Vec and LSTM networks.
>>>> It will be interesting to see if we can use it as document categorizer,
>>>> especially for sentiment analysis in OpenNLP.
>>>>
>>>> I have already raised a PR to the sandbox repo -
>>>> https://github.com/apache/opennlp-sandbox/pull/3
>>>>
>>>> This is first version, and I expect to receive feedback from Dev community
>>>> to make it work for everyone.
>>>>
>>>> Here are the design choices I have made for the initial version:
>>>>
>>>>    - Using pre-trained Gloves - I felt the glove vector format is clean,
>>>>    easily customizable in terms of dimensions and vocabulary size, and
>>>> (also I
>>>>    have been reading a lot about them from Stanford NLP group).
>>>>       - Training Gloves isnt hard either, we can do it using the original C
>>>>       library as well as by using DL4J.
>>>>       - Using DL4J's Multi layer networks with LSTM instead of reinventing
>>>>    this stuff again on JVM for OpenNLP
>>>>
>>>>
>>>> Please share your feedback here or on the github page
>>>> https://github.com/apache/opennlp-sandbox/pull/3 .
>>>>
>>>>
>>> I think the approach outlined here sounds good, I think we could
>>> incorporate the PR as soon as it implements the Doccat API.
>>> Then we may see whether and how it makes sense to adjust it to use other
>>> types of embeddings (e.g. paragraph vectors) and / or different network
>>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>>>
>>> Looking forward to see this move forward,
>>> Regards,
>>> Tommaso
>>>
>>>
>>>>
>>>> Thanks,
>>>> TG
>>>>
>>>>
>>>> --
>>>> *Thamme Gowda *
>>>> @thammegowda <https://twitter.com/thammegowda> |
>>>> http://scf.usc.edu/~tnarayan/
>>>> ~Sent via somebody's Webmail server
>>>>

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Rodrigo Agerri <ra...@apache.org>.

Hello again,

@Thamme, out of curiosity, do you have evaluation numbers on the
Stanford Large Movie Review dataset?

Best,

Rodrigo

On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <ra...@apache.org> wrote:
> +1 to Tommaso's comment. This would be very nice to have in the project.
>
> R
>
> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
> <to...@gmail.com> wrote:
>> thanks Thamme for bringing this to the list!
>>
>>
>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <tg...@gmail.com> ha
>> scritto:
>>
>>> Hello OpenNLP Devs,
>>>
>>> I am working with text classification using word embeddings like
>>> Gloves/Word2Vec and LSTM networks.
>>> It will be interesting to see if we can use it as document categorizer,
>>> especially for sentiment analysis in OpenNLP.
>>>
>>> I have already raised a PR to the sandbox repo -
>>> https://github.com/apache/opennlp-sandbox/pull/3
>>>
>>> This is first version, and I expect to receive feedback from Dev community
>>> to make it work for everyone.
>>>
>>> Here are the design choices I have made for the initial version:
>>>
>>>    - Using pre-trained Gloves - I felt the glove vector format is clean,
>>>    easily customizable in terms of dimensions and vocabulary size, and
>>> (also I
>>>    have been reading a lot about them from Stanford NLP group).
>>>       - Training Gloves isnt hard either, we can do it using the original C
>>>       library as well as by using DL4J.
>>>       - Using DL4J's Multi layer networks with LSTM instead of reinventing
>>>    this stuff again on JVM for OpenNLP
>>>
>>>
>>> Please share your feedback here or on the github page
>>> https://github.com/apache/opennlp-sandbox/pull/3 .
>>>
>>>
>> I think the approach outlined here sounds good, I think we could
>> incorporate the PR as soon as it implements the Doccat API.
>> Then we may see whether and how it makes sense to adjust it to use other
>> types of embeddings (e.g. paragraph vectors) and / or different network
>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>>
>> Looking forward to see this move forward,
>> Regards,
>> Tommaso
>>
>>
>>>
>>> Thanks,
>>> TG
>>>
>>>
>>> --
>>> *Thamme Gowda *
>>> @thammegowda <https://twitter.com/thammegowda> |
>>> http://scf.usc.edu/~tnarayan/
>>> ~Sent via somebody's Webmail server
>>>

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Rodrigo Agerri <ra...@apache.org>.

+1 to Tommaso's comment. This would be very nice to have in the project.

R

On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
<to...@gmail.com> wrote:
> thanks Thamme for bringing this to the list!
>
>
> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <tg...@gmail.com> ha
> scritto:
>
>> Hello OpenNLP Devs,
>>
>> I am working with text classification using word embeddings like
>> Gloves/Word2Vec and LSTM networks.
>> It will be interesting to see if we can use it as document categorizer,
>> especially for sentiment analysis in OpenNLP.
>>
>> I have already raised a PR to the sandbox repo -
>> https://github.com/apache/opennlp-sandbox/pull/3
>>
>> This is first version, and I expect to receive feedback from Dev community
>> to make it work for everyone.
>>
>> Here are the design choices I have made for the initial version:
>>
>>    - Using pre-trained Gloves - I felt the glove vector format is clean,
>>    easily customizable in terms of dimensions and vocabulary size, and
>> (also I
>>    have been reading a lot about them from Stanford NLP group).
>>       - Training Gloves isnt hard either, we can do it using the original C
>>       library as well as by using DL4J.
>>       - Using DL4J's Multi layer networks with LSTM instead of reinventing
>>    this stuff again on JVM for OpenNLP
>>
>>
>> Please share your feedback here or on the github page
>> https://github.com/apache/opennlp-sandbox/pull/3 .
>>
>>
> I think the approach outlined here sounds good, I think we could
> incorporate the PR as soon as it implements the Doccat API.
> Then we may see whether and how it makes sense to adjust it to use other
> types of embeddings (e.g. paragraph vectors) and / or different network
> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>
> Looking forward to see this move forward,
> Regards,
> Tommaso
>
>
>>
>> Thanks,
>> TG
>>
>>
>> --
>> *Thamme Gowda *
>> @thammegowda <https://twitter.com/thammegowda> |
>> http://scf.usc.edu/~tnarayan/
>> ~Sent via somebody's Webmail server
>>

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

Posted by Tommaso Teofili <to...@gmail.com>.

thanks Thamme for bringing this to the list!


Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <tg...@gmail.com> ha
scritto:

> Hello OpenNLP Devs,
>
> I am working with text classification using word embeddings like
> Gloves/Word2Vec and LSTM networks.
> It will be interesting to see if we can use it as document categorizer,
> especially for sentiment analysis in OpenNLP.
>
> I have already raised a PR to the sandbox repo -
> https://github.com/apache/opennlp-sandbox/pull/3
>
> This is first version, and I expect to receive feedback from Dev community
> to make it work for everyone.
>
> Here are the design choices I have made for the initial version:
>
>    - Using pre-trained Gloves - I felt the glove vector format is clean,
>    easily customizable in terms of dimensions and vocabulary size, and
> (also I
>    have been reading a lot about them from Stanford NLP group).
>       - Training Gloves isnt hard either, we can do it using the original C
>       library as well as by using DL4J.
>       - Using DL4J's Multi layer networks with LSTM instead of reinventing
>    this stuff again on JVM for OpenNLP
>
>
> Please share your feedback here or on the github page
> https://github.com/apache/opennlp-sandbox/pull/3 .
>
>
I think the approach outlined here sounds good, I think we could
incorporate the PR as soon as it implements the Doccat API.
Then we may see whether and how it makes sense to adjust it to use other
types of embeddings (e.g. paragraph vectors) and / or different network
setups (e.g. more hidden layers, bidirectionalLSTM, etc.).

Looking forward to see this move forward,
Regards,
Tommaso


>
> Thanks,
> TG
>
>
> --
> *Thamme Gowda *
> @thammegowda <https://twitter.com/thammegowda> |
> http://scf.usc.edu/~tnarayan/
> ~Sent via somebody's Webmail server
>