You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Bertrand Delacretaz <bd...@apache.org> on 2011/02/18 15:19:00 UTC

Sentiment analysis engine?

Hi,

I'm interested in looking at this [1] in my Copious Free Time, would
any of our existing enhancement engines allow me to experiment with
sentiment analysis?

Or does anyone have pointers to good (ideally java-based) tools that do this?

-Bertrand

[1] http://en.wikipedia.org/wiki/Sentiment_analysis

Re: Sentiment analysis engine?

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Bertrand,
I'd also be interested in working on such an engine.
What I know is much related to UIMA so it's just a point of view.
In particular I found some high level sample of that being done with UIMA
and a thesaurus [1], I think that's also relevant for work being done by
Florent.
A paper of a workshop of some time ago [2] mentions UIMA had been used for
that (see "Towards a Unified Framework for Sentiment Analysis and Search
with UIMA").
See also LingPipe [3] and NLTK (but it's Python) [4][5].
   Cheers,
Tommaso


[1] :
http://seasr.org/documentation/uima-and-seasr/sentiment-tracking-from-uima-data/
[2] : http://www.ids-mannheim.de/d-spin/workshop.pdf#page=43
[3] : http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html
[4] : http://www.nltk.org/
[5] :
http://dmcer.net/Brief%20Introduction%20to%20Natural%20Language%20Processing.pdf

2011/2/18 Bertrand Delacretaz <bd...@apache.org>

> Hi,
>
> I'm interested in looking at this [1] in my Copious Free Time, would
> any of our existing enhancement engines allow me to experiment with
> sentiment analysis?
>
> Or does anyone have pointers to good (ideally java-based) tools that do
> this?
>
> -Bertrand
>
> [1] http://en.wikipedia.org/wiki/Sentiment_analysis
>

Re: Sentiment analysis engine?

Posted by Tommaso Teofili <to...@gmail.com>.

2011/2/26 Olivier Grisel <ol...@ensta.org>

> 2011/2/26 Tommaso Teofili <to...@gmail.com>:
> > Hi again,
> > for the sentiment analysis engine I think that we can start from a small
> POC
> > which uses alchemyapi.com service (just released) for sentiment analysis
> > [1].
> > In UIMA there is already an AlchemyAPIAnnotator [2][3] we can use for the
> > purpose inside a Stanbol engine.
> > Maybe once we have such an implementation we can improve it to avoid
> relying
> > on external service in the following way:
> >
> >   - get a corpus of data (not annotated free text) to train our engine
> for
> >   sentiment analysis.
> >   - massively pass them to the engine using alchemyapi.com creating the
> >   output as text annotated with the extracted sentiment.
> >   - pass them to Mahout [4] (i.e. clustering) to create a (statistical)
> >
> https://mail.google.com/mail/u/0/?ui=2&shva=1#inbox/12e3923c756f9e34model.
>
> Sentiment analysis is not a clustering task, it's a supervised
> document classification / regression task (comparable to language
> detection for instance).
>

thanks for the clarification :)


>
> You can use Mahout or OpenNLP to build such a classifier as explained
> in my previous mail.
>

agreed


>
> >   - refactor the engine to load the generated module and detach from
> >   alchemyapi.com.
> >   - each document sent to the engine gets a sentiment "against" the model
> >   created.
> >
> > What do you think?
>
> Ok for a new wrapper for AlchemyAPI. We will need to extend the
> vocabulary for document classification tasks such as topic assignment
> and sentiment analysis, see:
>
>  https://issues.apache.org/jira/browse/STANBOL-28
>  https://issues.apache.org/jira/browse/STANBOL-29


good point :)


>
>
> It's an interesting approach to reuse the output of other trained
> models. However it might even be easier to use the datasets I
> mentioned earlier.
>

I agree so we can actually use both starting from movie datasets already
available and eventually switching to more general purpose trained models.

Cheers,
Tommaso


>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>

Re: Sentiment analysis engine?

Posted by Olivier Grisel <ol...@ensta.org>.

2011/2/26 Tommaso Teofili <to...@gmail.com>:
> Hi again,
> for the sentiment analysis engine I think that we can start from a small POC
> which uses alchemyapi.com service (just released) for sentiment analysis
> [1].
> In UIMA there is already an AlchemyAPIAnnotator [2][3] we can use for the
> purpose inside a Stanbol engine.
> Maybe once we have such an implementation we can improve it to avoid relying
> on external service in the following way:
>
>   - get a corpus of data (not annotated free text) to train our engine for
>   sentiment analysis.
>   - massively pass them to the engine using alchemyapi.com creating the
>   output as text annotated with the extracted sentiment.
>   - pass them to Mahout [4] (i.e. clustering) to create a (statistical)
>   https://mail.google.com/mail/u/0/?ui=2&shva=1#inbox/12e3923c756f9e34model.

Sentiment analysis is not a clustering task, it's a supervised
document classification / regression task (comparable to language
detection for instance).

You can use Mahout or OpenNLP to build such a classifier as explained
in my previous mail.

>   - refactor the engine to load the generated module and detach from
>   alchemyapi.com.
>   - each document sent to the engine gets a sentiment "against" the model
>   created.
>
> What do you think?

Ok for a new wrapper for AlchemyAPI. We will need to extend the
vocabulary for document classification tasks such as topic assignment
and sentiment analysis, see:

  https://issues.apache.org/jira/browse/STANBOL-28
  https://issues.apache.org/jira/browse/STANBOL-29

It's an interesting approach to reuse the output of other trained
models. However it might even be easier to use the datasets I
mentioned earlier.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: Sentiment analysis engine?

Posted by Tommaso Teofili <to...@gmail.com>.

Hi again,
for the sentiment analysis engine I think that we can start from a small POC
which uses alchemyapi.com service (just released) for sentiment analysis
[1].
In UIMA there is already an AlchemyAPIAnnotator [2][3] we can use for the
purpose inside a Stanbol engine.
Maybe once we have such an implementation we can improve it to avoid relying
on external service in the following way:

   - get a corpus of data (not annotated free text) to train our engine for
   sentiment analysis.
   - massively pass them to the engine using alchemyapi.com creating the
   output as text annotated with the extracted sentiment.
   - pass them to Mahout [4] (i.e. clustering) to create a (statistical)
   model.
   - refactor the engine to load the generated module and detach from
   alchemyapi.com.
   - each document sent to the engine gets a sentiment "against" the model
   created.

What do you think?
Cheers,
Tommaso

[1] : http://www.alchemyapi.com/api/sentiment/textc.html
[2] : http://uima.apache.org/sandbox.html#alchemy.annotator
[3] : https://issues.apache.org/jira/browse/UIMA-2073
[4] : http://mahout.apache.org/

2011/2/19 Olivier Grisel <ol...@ensta.org>

> 2011/2/18 Bertrand Delacretaz <bd...@apache.org>:
> > Hi,
> >
> > I'm interested in looking at this [1] in my Copious Free Time, would
> > any of our existing enhancement engines allow me to experiment with
> > sentiment analysis?
> >
> > Or does anyone have pointers to good (ideally java-based) tools that do
> this?
>
> There is a text classification utility in opennlp called
> DocumentCategorizerME (BTW I am almost done with upgrading the
> dependency to version 1.5).
>
> Mahout also has classifiers and some of them such as the SGD Logistic
> Regression classifier do not require to setup a hadoop cluster to use:
>
>    org.apache.mahout.classifier.sgd.OnlineLogisticRegression
>
> To use if for document classification you will first need to extract
> feature vectors out of the text. This is explained here:
>
>    https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>
> In both cases to turn a document classifier into a sentiment analysis
> tool you need a training corpus with sample documents labeled either
> as positive or negative (and sometimes neutral). AFAIK, nltk uses a
> movie review corpus to train polarity models for sentiment analysis. I
> think it originates from this work (where you can download the raw
> data without installing nltk):
>
>    http://www.cs.cornell.edu/people/pabo/movie-review-data/
>
> This is probably a good start even though it's limited to the cinema
> domain and the English language. There is a much larger multilingual
> corpus for various products types available here (however the format
> is pre-processed probably for copyright issues, hence you will need to
> write a dedicated vectorizer):
>
>  http://www.webis.de/research/corpora/webis-cls-10
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>

Re: Sentiment analysis engine?

Posted by Olivier Grisel <ol...@ensta.org>.

2011/2/18 Bertrand Delacretaz <bd...@apache.org>:
> Hi,
>
> I'm interested in looking at this [1] in my Copious Free Time, would
> any of our existing enhancement engines allow me to experiment with
> sentiment analysis?
>
> Or does anyone have pointers to good (ideally java-based) tools that do this?

There is a text classification utility in opennlp called
DocumentCategorizerME (BTW I am almost done with upgrading the
dependency to version 1.5).

Mahout also has classifiers and some of them such as the SGD Logistic
Regression classifier do not require to setup a hadoop cluster to use:

    org.apache.mahout.classifier.sgd.OnlineLogisticRegression

To use if for document classification you will first need to extract
feature vectors out of the text. This is explained here:

    https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

In both cases to turn a document classifier into a sentiment analysis
tool you need a training corpus with sample documents labeled either
as positive or negative (and sometimes neutral). AFAIK, nltk uses a
movie review corpus to train polarity models for sentiment analysis. I
think it originates from this work (where you can download the raw
data without installing nltk):

    http://www.cs.cornell.edu/people/pabo/movie-review-data/

This is probably a good start even though it's limited to the cinema
domain and the English language. There is a much larger multilingual
corpus for various products types available here (however the format
is pre-processed probably for copyright issues, hence you will need to
write a dedicated vectorizer):

  http://www.webis.de/research/corpora/webis-cls-10

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel