You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Nick Burch <ni...@apache.org> on 2019/02/06 19:03:10 UTC

Building a sentement analyser - comments and questions!

Hi All

Last week, I took part in a hackathon for Alfresco, the open source 
content management system, and as part of that were having a play with 
integrating Sentement Analysis [1]. As the Standford CoreNLP has sentement 
analysis built it, we first used that. Then I tried to use Apache OpenNLP 
instead. This wasn't that easy, but ended up working better for our test 
documents.

I figured it might be good to share my experiences, in case there's things 
I could improve, or in case there's documentation / examples / etc that 
could be improved!


So, first up, the approach. I couldn't find anything in the docs on 
sentement analysis. So, I decided to try using the Document Classifier, 
and feed it two categories to learn/predict on, positive and negative. Is 
that the best route?

(I did find a 2016 GSOC project to add sentement analysis, but decided to 
stick with just core OpenNLP code)


Next, I hit a snag - the code at 
https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat 
doesn't compile against 1.9.1. I've raised 
https://issues.apache.org/jira/browse/OPENNLP-1237 for this.


Having guessed at the new API syntax, I then needed to feed in some 
training data. Based on [2], I opted for the JHU amazon review data. Not 
sure if there are better free datasets for English language sentiment?


Next snag - the data format. The JHU data isn't in the same format as the 
training tool or PlainTextByLineStream expects. What's more, I couldn't 
find any examples of an alternate DocumentSampleStream input or 
ObjectStream<DocumentSample> in the manual. Is there one? Is there 
anything else on writing your own? Should there be?

(I ended up writing one [4] in Groovy, which I'm fairly sure is non-ideal, 
and probably could be much improved, suggestions welcome!)


Next challenge - TrainingParameters. Several blog posts I found on using 
the DoccatFactory suggested a cutoff of 2 and iterations of 30. I couldn't 
spot anything in the manual under Document Categorizer for parameters, 
though other sections did have them. Did I miss it? Should there be 
something in the manual?


Building the model was nice and quick, and getting predictions easy too, 
which was good! However, with my (quite possibly wrong) plan of training 
for two categories, Positive or Negative, I wasn't able to see how to get 
a good "how much sentiment" out. I opted for just returning whichever 
category was reported as best, with no score (since typically the two 
categories came back with very similar scores, though one generally 
slightly higher than the other). Is there a better way?


Finally, it did all work, and for our testing better than StanfordNLP, so 
thanks everyone for the library :)

Thanks
Nick

[1] https://github.com/Alfresco/SentimentAnalysis
[2] https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
[3] http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
[4] https://github.com/Alfresco/SentimentAnalysis/blob/master/sentiment-analysis/src/main/groovy/JHUSentimentReader.groovy

Re: Building a sentement analyser - comments and questions!

Posted by Joern Kottmann <ko...@gmail.com>.
Hello Nick,

thanks for your feedback.

It would be very nice if you could help us improve things. The doccat
component is used by many users I know, and I am sure they would
benefit from your help.

Yes, you are supposed to implement your own
ObjectStream<DocumentSample> in case the default is not good for some
reason.

And we should extend the manual about how to pass the
TrainingParameters (we should check that for every component).

Happy you found it useful anyway, and lets see if you can address your
points with manual updates, and code changes to make training easier
are also always useful.

Jörn

On Wed, Feb 6, 2019 at 8:09 PM Nick Burch <ni...@apache.org> wrote:
>
> Hi All
>
> Last week, I took part in a hackathon for Alfresco, the open source
> content management system, and as part of that were having a play with
> integrating Sentement Analysis [1]. As the Standford CoreNLP has sentement
> analysis built it, we first used that. Then I tried to use Apache OpenNLP
> instead. This wasn't that easy, but ended up working better for our test
> documents.
>
> I figured it might be good to share my experiences, in case there's things
> I could improve, or in case there's documentation / examples / etc that
> could be improved!
>
>
> So, first up, the approach. I couldn't find anything in the docs on
> sentement analysis. So, I decided to try using the Document Classifier,
> and feed it two categories to learn/predict on, positive and negative. Is
> that the best route?
>
> (I did find a 2016 GSOC project to add sentement analysis, but decided to
> stick with just core OpenNLP code)
>
>
> Next, I hit a snag - the code at
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat
> doesn't compile against 1.9.1. I've raised
> https://issues.apache.org/jira/browse/OPENNLP-1237 for this.
>
>
> Having guessed at the new API syntax, I then needed to feed in some
> training data. Based on [2], I opted for the JHU amazon review data. Not
> sure if there are better free datasets for English language sentiment?
>
>
> Next snag - the data format. The JHU data isn't in the same format as the
> training tool or PlainTextByLineStream expects. What's more, I couldn't
> find any examples of an alternate DocumentSampleStream input or
> ObjectStream<DocumentSample> in the manual. Is there one? Is there
> anything else on writing your own? Should there be?
>
> (I ended up writing one [4] in Groovy, which I'm fairly sure is non-ideal,
> and probably could be much improved, suggestions welcome!)
>
>
> Next challenge - TrainingParameters. Several blog posts I found on using
> the DoccatFactory suggested a cutoff of 2 and iterations of 30. I couldn't
> spot anything in the manual under Document Categorizer for parameters,
> though other sections did have them. Did I miss it? Should there be
> something in the manual?
>
>
> Building the model was nice and quick, and getting predictions easy too,
> which was good! However, with my (quite possibly wrong) plan of training
> for two categories, Positive or Negative, I wasn't able to see how to get
> a good "how much sentiment" out. I opted for just returning whichever
> category was reported as best, with no score (since typically the two
> categories came back with very similar scores, though one generally
> slightly higher than the other). Is there a better way?
>
>
> Finally, it did all work, and for our testing better than StanfordNLP, so
> thanks everyone for the library :)
>
> Thanks
> Nick
>
> [1] https://github.com/Alfresco/SentimentAnalysis
> [2] https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
> [3] http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
> [4] https://github.com/Alfresco/SentimentAnalysis/blob/master/sentiment-analysis/src/main/groovy/JHUSentimentReader.groovy