You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Mondher Bouazizi <mo...@gmail.com> on 2016/06/20 15:19:26 UTC

Performances of OpenNLP tools

Hi,

Apologies if you received multiple copies of this email. I sent it to the
users list a while ago, and haven't had an answer yet.

I have been looking for a while if there is any relevant work that
performed tests on the OpenNLP tools (in particular the Lemmatizer,
Tokenizer and PoS-Tagger) when used with short and noisy texts such as
Twitter data, etc., and/or compared it to other libraries.

By performances, I mean accuracy/precision, rather than time of execution,
etc.

If anyone can refer me to a paper or a work done in this context, that
would be of great help.

Thank you very much.

Mondher

Re: Performances of OpenNLP tools

Posted by Joern Kottmann <ko...@gmail.com>.

You should get a copy of OntoNotes (it is for free) and OpenNLP already has
support to train models on it.
So the entry barrier to get started with this corpus is very low.

Jörn

On Wed, Jun 29, 2016 at 11:22 AM, Anthony Beylerian <
anthony.beylerian@gmail.com> wrote:

> How about we keep track of the sets used for performance evaluation and
> results in this doc for now:
>
>
> https://docs.google.com/spreadsheets/d/15c0-u61HNWfQxiDSGjk49M1uBknIfb-LkbP4BDWTB5w/edit?usp=sharing
>
> Will try to take a better look at OntoNotes and what to use from it.
> Otherwise, if anyone would like to suggest proper data-sets for testing
> each component that would be really helpful
>
> Anthony
>
> On Thu, Jun 23, 2016 at 12:18 AM, Joern Kottmann <ko...@gmail.com>
> wrote:
>
> > It would be nice to get MASC support into the OpenNLP formats package.
> >
> > Jörn
> >
> > On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <
> jasonbaldridge@gmail.com
> > >
> > wrote:
> >
> > > Jörn is absolutely right about that. Another good source of training
> data
> > > is MASC. I've got some instructions for training models with MASC here:
> > >
> > > https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
> > >
> > > Chalk (now defunct) provided a Scala wrapper around OpenNLP
> > functionality,
> > > so the instructions there should make it fairly straightforward to
> adapt
> > > MASC data to OpenNLP.
> > >
> > > -Jason
> > >
> > > On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <ko...@gmail.com>
> wrote:
> > >
> > > > There are some research papers which study and compare the
> performance
> > of
> > > > NLP toolkits, but be careful often they don't train the NLP tools on
> > the
> > > > same data and the training data makes a big difference on the
> > > performance.
> > > >
> > > > Jörn
> > > >
> > > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <ko...@gmail.com>
> > > > wrote:
> > > >
> > > > > Just don't use the very old existing models, to get good results
> you
> > > have
> > > > > to train on your own data, especially if the domain of the data
> used
> > > for
> > > > > training and the data which should be processed doesn't match. The
> > old
> > > > > models are trained on 90s news, those don't work well on todays
> news
> > > and
> > > > > probably much worse on tweets.
> > > > >
> > > > > OntoNots is a good place to start if the goal is to process news.
> > > OpenNLP
> > > > > comes with build-in support to train models from OntoNotes.
> > > > >
> > > > > Jörn
> > > > >
> > > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > > > >
> > > > >> This sounds like a fantastic idea.
> > > > >>
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> Chris Mattmann, Ph.D.
> > > > >> Chief Architect
> > > > >> Instrument Software and Science Data Systems Section (398)
> > > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > >> Office: 168-519, Mailstop: 168-527
> > > > >> Email: chris.a.mattmann@nasa.gov
> > > > >> WWW:  http://sunset.usc.edu/~mattmann/
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> Director, Information Retrieval and Data Science Group (IRDS)
> > > > >> Adjunct Associate Professor, Computer Science Department
> > > > >> University of Southern California, Los Angeles, CA 90089 USA
> > > > >> WWW: http://irds.usc.edu/
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> > > anthonybeylerian@hotmail.com
> > > > >
> > > > >> wrote:
> > > > >>
> > > > >> >+1
> > > > >> >
> > > > >> >Maybe we could put the results of the evaluator tests for each
> > > > component
> > > > >> somewhere on a webpage and on every release update them.
> > > > >> >This is of course provided there are reasonable data sets for
> > testing
> > > > >> each component.
> > > > >> >What do you think?
> > > > >> >
> > > > >> >Anthony
> > > > >> >
> > > > >> >> From: mondher.bouazizi@gmail.com
> > > > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > > > >> >> Subject: Re: Performances of OpenNLP tools
> > > > >> >> To: dev@opennlp.apache.org
> > > > >> >>
> > > > >> >> Hi,
> > > > >> >>
> > > > >> >> Thank you for your replies.
> > > > >> >>
> > > > >> >> Please Jeffrey accept once more my apologies for receiving the
> > > email
> > > > >> twice.
> > > > >> >>
> > > > >> >> I also think it would be great to have such studies on the
> > > > >> performances of
> > > > >> >> OpenNLP.
> > > > >> >>
> > > > >> >> I have been looking for this information and checked in many
> > > places,
> > > > >> >> including obviously google scholar, and I haven't found any
> > serious
> > > > >> studies
> > > > >> >> or reliable results. Most of the existing ones report the
> > > > performances
> > > > >> of
> > > > >> >> outdated releases of OpenNLP, and focus more on the execution
> > time
> > > or
> > > > >> >> CPU/RAM consumption, etc.
> > > > >> >>
> > > > >> >> I think such a comparison will help not only evaluate the
> overall
> > > > >> accuracy,
> > > > >> >> but also highlight the issues with the existing models (as a
> > matter
> > > > of
> > > > >> >> fact, the existing models fail to recognize many of the
> hashtags
> > in
> > > > >> tweets:
> > > > >> >> the tokenizer splits them into the "#" symbol and a word that
> the
> > > PoS
> > > > >> >> tagger also fails to recognize).
> > > > >> >>
> > > > >> >> Therefore, building Twitter-based models would also be useful,
> > > since
> > > > >> many
> > > > >> >> of the works in academia / industry are focusing on Twitter
> data.
> > > > >> >>
> > > > >> >> Best regards,
> > > > >> >>
> > > > >> >> Mondher
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> > > > >> jasonbaldridge@gmail.com>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > It would be fantastic to have these numbers. This is an
> example
> > > of
> > > > >> >> > something that would be a great contribution by someone
> trying
> > to
> > > > >> >> > contribute to open source and who is maybe just getting into
> > > > machine
> > > > >> >> > learning and natural language processing.
> > > > >> >> >
> > > > >> >> > For Twitter-ish text, it'd be great to look at models trained
> > and
> > > > >> evaluated
> > > > >> >> > on the Tweet NLP resources:
> > > > >> >> >
> > > > >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> > > > >> >> >
> > > > >> >> > And comparing to how their models performed, etc. Also, it's
> > > worth
> > > > >> looking
> > > > >> >> > at spaCy (Python NLP modules) for further comparisons.
> > > > >> >> >
> > > > >> >> > https://spacy.io/
> > > > >> >> >
> > > > >> >> > -Jason
> > > > >> >> >
> > > > >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <
> > > > jzemerick@apache.org>
> > > > >> >> > wrote:
> > > > >> >> >
> > > > >> >> > > I saw the same question on the users list on June 17. At
> > least
> > > I
> > > > >> thought
> > > > >> >> > it
> > > > >> >> > > was the same question -- sorry if it wasn't.
> > > > >> >> > >
> > > > >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980)
> <
> > > > >> >> > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > > > >> >> > >
> > > > >> >> > > > Well, hold on. He sent that mail (as of the time of this
> > > mail)
> > > > 4
> > > > >> >> > > > mins previously. Maybe some folks need some time to reply
> > ^_^
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> >> > > > Chris Mattmann, Ph.D.
> > > > >> >> > > > Chief Architect
> > > > >> >> > > > Instrument Software and Science Data Systems Section
> (398)
> > > > >> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > >> >> > > > Office: 168-519, Mailstop: 168-527
> > > > >> >> > > > Email: chris.a.mattmann@nasa.gov
> > > > >> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > > >> >> > > >
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> >> > > > Director, Information Retrieval and Data Science Group
> > (IRDS)
> > > > >> >> > > > Adjunct Associate Professor, Computer Science Department
> > > > >> >> > > > University of Southern California, Los Angeles, CA 90089
> > USA
> > > > >> >> > > > WWW: http://irds.usc.edu/
> > > > >> >> > > >
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <
> > > jzemerick@apache.org>
> > > > >> wrote:
> > > > >> >> > > >
> > > > >> >> > > > >Hi Mondher,
> > > > >> >> > > > >
> > > > >> >> > > > >Since you didn't get any replies I'm guessing no one is
> > > aware
> > > > >> of any
> > > > >> >> > > > >resources related to what you need. Google Scholar is a
> > good
> > > > >> place to
> > > > >> >> > > look
> > > > >> >> > > > >for papers referencing OpenNLP and its methods (in case
> > you
> > > > >> haven't
> > > > >> >> > > > >searched it already).
> > > > >> >> > > > >
> > > > >> >> > > > >Jeff
> > > > >> >> > > > >
> > > > >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > > > >> >> > > > >mondher.bouazizi@gmail.com> wrote:
> > > > >> >> > > > >
> > > > >> >> > > > >> Hi,
> > > > >> >> > > > >>
> > > > >> >> > > > >> Apologies if you received multiple copies of this
> > email. I
> > > > >> sent it
> > > > >> >> > to
> > > > >> >> > > > the
> > > > >> >> > > > >> users list a while ago, and haven't had an answer yet.
> > > > >> >> > > > >>
> > > > >> >> > > > >> I have been looking for a while if there is any
> relevant
> > > > work
> > > > >> that
> > > > >> >> > > > >> performed tests on the OpenNLP tools (in particular
> the
> > > > >> Lemmatizer,
> > > > >> >> > > > >> Tokenizer and PoS-Tagger) when used with short and
> noisy
> > > > >> texts such
> > > > >> >> > as
> > > > >> >> > > > >> Twitter data, etc., and/or compared it to other
> > libraries.
> > > > >> >> > > > >>
> > > > >> >> > > > >> By performances, I mean accuracy/precision, rather
> than
> > > time
> > > > >> of
> > > > >> >> > > > execution,
> > > > >> >> > > > >> etc.
> > > > >> >> > > > >>
> > > > >> >> > > > >> If anyone can refer me to a paper or a work done in
> this
> > > > >> context,
> > > > >> >> > that
> > > > >> >> > > > >> would be of great help.
> > > > >> >> > > > >>
> > > > >> >> > > > >> Thank you very much.
> > > > >> >> > > > >>
> > > > >> >> > > > >> Mondher
> > > > >> >> > > > >>
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Performances of OpenNLP tools

Posted by Anthony Beylerian <an...@gmail.com>.

How about we keep track of the sets used for performance evaluation and
results in this doc for now:

https://docs.google.com/spreadsheets/d/15c0-u61HNWfQxiDSGjk49M1uBknIfb-LkbP4BDWTB5w/edit?usp=sharing

Will try to take a better look at OntoNotes and what to use from it.
Otherwise, if anyone would like to suggest proper data-sets for testing
each component that would be really helpful

Anthony

On Thu, Jun 23, 2016 at 12:18 AM, Joern Kottmann <ko...@gmail.com> wrote:

> It would be nice to get MASC support into the OpenNLP formats package.
>
> Jörn
>
> On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <jasonbaldridge@gmail.com
> >
> wrote:
>
> > Jörn is absolutely right about that. Another good source of training data
> > is MASC. I've got some instructions for training models with MASC here:
> >
> > https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
> >
> > Chalk (now defunct) provided a Scala wrapper around OpenNLP
> functionality,
> > so the instructions there should make it fairly straightforward to adapt
> > MASC data to OpenNLP.
> >
> > -Jason
> >
> > On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <ko...@gmail.com> wrote:
> >
> > > There are some research papers which study and compare the performance
> of
> > > NLP toolkits, but be careful often they don't train the NLP tools on
> the
> > > same data and the training data makes a big difference on the
> > performance.
> > >
> > > Jörn
> > >
> > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <ko...@gmail.com>
> > > wrote:
> > >
> > > > Just don't use the very old existing models, to get good results you
> > have
> > > > to train on your own data, especially if the domain of the data used
> > for
> > > > training and the data which should be processed doesn't match. The
> old
> > > > models are trained on 90s news, those don't work well on todays news
> > and
> > > > probably much worse on tweets.
> > > >
> > > > OntoNots is a good place to start if the goal is to process news.
> > OpenNLP
> > > > comes with build-in support to train models from OntoNotes.
> > > >
> > > > Jörn
> > > >
> > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > > >
> > > >> This sounds like a fantastic idea.
> > > >>
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> Chris Mattmann, Ph.D.
> > > >> Chief Architect
> > > >> Instrument Software and Science Data Systems Section (398)
> > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > >> Office: 168-519, Mailstop: 168-527
> > > >> Email: chris.a.mattmann@nasa.gov
> > > >> WWW:  http://sunset.usc.edu/~mattmann/
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> Director, Information Retrieval and Data Science Group (IRDS)
> > > >> Adjunct Associate Professor, Computer Science Department
> > > >> University of Southern California, Los Angeles, CA 90089 USA
> > > >> WWW: http://irds.usc.edu/
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> > anthonybeylerian@hotmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >> >+1
> > > >> >
> > > >> >Maybe we could put the results of the evaluator tests for each
> > > component
> > > >> somewhere on a webpage and on every release update them.
> > > >> >This is of course provided there are reasonable data sets for
> testing
> > > >> each component.
> > > >> >What do you think?
> > > >> >
> > > >> >Anthony
> > > >> >
> > > >> >> From: mondher.bouazizi@gmail.com
> > > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > > >> >> Subject: Re: Performances of OpenNLP tools
> > > >> >> To: dev@opennlp.apache.org
> > > >> >>
> > > >> >> Hi,
> > > >> >>
> > > >> >> Thank you for your replies.
> > > >> >>
> > > >> >> Please Jeffrey accept once more my apologies for receiving the
> > email
> > > >> twice.
> > > >> >>
> > > >> >> I also think it would be great to have such studies on the
> > > >> performances of
> > > >> >> OpenNLP.
> > > >> >>
> > > >> >> I have been looking for this information and checked in many
> > places,
> > > >> >> including obviously google scholar, and I haven't found any
> serious
> > > >> studies
> > > >> >> or reliable results. Most of the existing ones report the
> > > performances
> > > >> of
> > > >> >> outdated releases of OpenNLP, and focus more on the execution
> time
> > or
> > > >> >> CPU/RAM consumption, etc.
> > > >> >>
> > > >> >> I think such a comparison will help not only evaluate the overall
> > > >> accuracy,
> > > >> >> but also highlight the issues with the existing models (as a
> matter
> > > of
> > > >> >> fact, the existing models fail to recognize many of the hashtags
> in
> > > >> tweets:
> > > >> >> the tokenizer splits them into the "#" symbol and a word that the
> > PoS
> > > >> >> tagger also fails to recognize).
> > > >> >>
> > > >> >> Therefore, building Twitter-based models would also be useful,
> > since
> > > >> many
> > > >> >> of the works in academia / industry are focusing on Twitter data.
> > > >> >>
> > > >> >> Best regards,
> > > >> >>
> > > >> >> Mondher
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> > > >> jasonbaldridge@gmail.com>
> > > >> >> wrote:
> > > >> >>
> > > >> >> > It would be fantastic to have these numbers. This is an example
> > of
> > > >> >> > something that would be a great contribution by someone trying
> to
> > > >> >> > contribute to open source and who is maybe just getting into
> > > machine
> > > >> >> > learning and natural language processing.
> > > >> >> >
> > > >> >> > For Twitter-ish text, it'd be great to look at models trained
> and
> > > >> evaluated
> > > >> >> > on the Tweet NLP resources:
> > > >> >> >
> > > >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> > > >> >> >
> > > >> >> > And comparing to how their models performed, etc. Also, it's
> > worth
> > > >> looking
> > > >> >> > at spaCy (Python NLP modules) for further comparisons.
> > > >> >> >
> > > >> >> > https://spacy.io/
> > > >> >> >
> > > >> >> > -Jason
> > > >> >> >
> > > >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <
> > > jzemerick@apache.org>
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> > > I saw the same question on the users list on June 17. At
> least
> > I
> > > >> thought
> > > >> >> > it
> > > >> >> > > was the same question -- sorry if it wasn't.
> > > >> >> > >
> > > >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> > > >> >> > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > > >> >> > >
> > > >> >> > > > Well, hold on. He sent that mail (as of the time of this
> > mail)
> > > 4
> > > >> >> > > > mins previously. Maybe some folks need some time to reply
> ^_^
> > > >> >> > > >
> > > >> >> > > >
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> >> > > > Chris Mattmann, Ph.D.
> > > >> >> > > > Chief Architect
> > > >> >> > > > Instrument Software and Science Data Systems Section (398)
> > > >> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > >> >> > > > Office: 168-519, Mailstop: 168-527
> > > >> >> > > > Email: chris.a.mattmann@nasa.gov
> > > >> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > >> >> > > >
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> >> > > > Director, Information Retrieval and Data Science Group
> (IRDS)
> > > >> >> > > > Adjunct Associate Professor, Computer Science Department
> > > >> >> > > > University of Southern California, Los Angeles, CA 90089
> USA
> > > >> >> > > > WWW: http://irds.usc.edu/
> > > >> >> > > >
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <
> > jzemerick@apache.org>
> > > >> wrote:
> > > >> >> > > >
> > > >> >> > > > >Hi Mondher,
> > > >> >> > > > >
> > > >> >> > > > >Since you didn't get any replies I'm guessing no one is
> > aware
> > > >> of any
> > > >> >> > > > >resources related to what you need. Google Scholar is a
> good
> > > >> place to
> > > >> >> > > look
> > > >> >> > > > >for papers referencing OpenNLP and its methods (in case
> you
> > > >> haven't
> > > >> >> > > > >searched it already).
> > > >> >> > > > >
> > > >> >> > > > >Jeff
> > > >> >> > > > >
> > > >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > > >> >> > > > >mondher.bouazizi@gmail.com> wrote:
> > > >> >> > > > >
> > > >> >> > > > >> Hi,
> > > >> >> > > > >>
> > > >> >> > > > >> Apologies if you received multiple copies of this
> email. I
> > > >> sent it
> > > >> >> > to
> > > >> >> > > > the
> > > >> >> > > > >> users list a while ago, and haven't had an answer yet.
> > > >> >> > > > >>
> > > >> >> > > > >> I have been looking for a while if there is any relevant
> > > work
> > > >> that
> > > >> >> > > > >> performed tests on the OpenNLP tools (in particular the
> > > >> Lemmatizer,
> > > >> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy
> > > >> texts such
> > > >> >> > as
> > > >> >> > > > >> Twitter data, etc., and/or compared it to other
> libraries.
> > > >> >> > > > >>
> > > >> >> > > > >> By performances, I mean accuracy/precision, rather than
> > time
> > > >> of
> > > >> >> > > > execution,
> > > >> >> > > > >> etc.
> > > >> >> > > > >>
> > > >> >> > > > >> If anyone can refer me to a paper or a work done in this
> > > >> context,
> > > >> >> > that
> > > >> >> > > > >> would be of great help.
> > > >> >> > > > >>
> > > >> >> > > > >> Thank you very much.
> > > >> >> > > > >>
> > > >> >> > > > >> Mondher
> > > >> >> > > > >>
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Performances of OpenNLP tools

Posted by Joern Kottmann <ko...@gmail.com>.

It would be nice to get MASC support into the OpenNLP formats package.

Jörn

On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <ja...@gmail.com>
wrote:

> Jörn is absolutely right about that. Another good source of training data
> is MASC. I've got some instructions for training models with MASC here:
>
> https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
>
> Chalk (now defunct) provided a Scala wrapper around OpenNLP functionality,
> so the instructions there should make it fairly straightforward to adapt
> MASC data to OpenNLP.
>
> -Jason
>
> On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <ko...@gmail.com> wrote:
>
> > There are some research papers which study and compare the performance of
> > NLP toolkits, but be careful often they don't train the NLP tools on the
> > same data and the training data makes a big difference on the
> performance.
> >
> > Jörn
> >
> > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <ko...@gmail.com>
> > wrote:
> >
> > > Just don't use the very old existing models, to get good results you
> have
> > > to train on your own data, especially if the domain of the data used
> for
> > > training and the data which should be processed doesn't match. The old
> > > models are trained on 90s news, those don't work well on todays news
> and
> > > probably much worse on tweets.
> > >
> > > OntoNots is a good place to start if the goal is to process news.
> OpenNLP
> > > comes with build-in support to train models from OntoNotes.
> > >
> > > Jörn
> > >
> > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > >
> > >> This sounds like a fantastic idea.
> > >>
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: chris.a.mattmann@nasa.gov
> > >> WWW:  http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Director, Information Retrieval and Data Science Group (IRDS)
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> WWW: http://irds.usc.edu/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> anthonybeylerian@hotmail.com
> > >
> > >> wrote:
> > >>
> > >> >+1
> > >> >
> > >> >Maybe we could put the results of the evaluator tests for each
> > component
> > >> somewhere on a webpage and on every release update them.
> > >> >This is of course provided there are reasonable data sets for testing
> > >> each component.
> > >> >What do you think?
> > >> >
> > >> >Anthony
> > >> >
> > >> >> From: mondher.bouazizi@gmail.com
> > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > >> >> Subject: Re: Performances of OpenNLP tools
> > >> >> To: dev@opennlp.apache.org
> > >> >>
> > >> >> Hi,
> > >> >>
> > >> >> Thank you for your replies.
> > >> >>
> > >> >> Please Jeffrey accept once more my apologies for receiving the
> email
> > >> twice.
> > >> >>
> > >> >> I also think it would be great to have such studies on the
> > >> performances of
> > >> >> OpenNLP.
> > >> >>
> > >> >> I have been looking for this information and checked in many
> places,
> > >> >> including obviously google scholar, and I haven't found any serious
> > >> studies
> > >> >> or reliable results. Most of the existing ones report the
> > performances
> > >> of
> > >> >> outdated releases of OpenNLP, and focus more on the execution time
> or
> > >> >> CPU/RAM consumption, etc.
> > >> >>
> > >> >> I think such a comparison will help not only evaluate the overall
> > >> accuracy,
> > >> >> but also highlight the issues with the existing models (as a matter
> > of
> > >> >> fact, the existing models fail to recognize many of the hashtags in
> > >> tweets:
> > >> >> the tokenizer splits them into the "#" symbol and a word that the
> PoS
> > >> >> tagger also fails to recognize).
> > >> >>
> > >> >> Therefore, building Twitter-based models would also be useful,
> since
> > >> many
> > >> >> of the works in academia / industry are focusing on Twitter data.
> > >> >>
> > >> >> Best regards,
> > >> >>
> > >> >> Mondher
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> > >> jasonbaldridge@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >> > It would be fantastic to have these numbers. This is an example
> of
> > >> >> > something that would be a great contribution by someone trying to
> > >> >> > contribute to open source and who is maybe just getting into
> > machine
> > >> >> > learning and natural language processing.
> > >> >> >
> > >> >> > For Twitter-ish text, it'd be great to look at models trained and
> > >> evaluated
> > >> >> > on the Tweet NLP resources:
> > >> >> >
> > >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> > >> >> >
> > >> >> > And comparing to how their models performed, etc. Also, it's
> worth
> > >> looking
> > >> >> > at spaCy (Python NLP modules) for further comparisons.
> > >> >> >
> > >> >> > https://spacy.io/
> > >> >> >
> > >> >> > -Jason
> > >> >> >
> > >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <
> > jzemerick@apache.org>
> > >> >> > wrote:
> > >> >> >
> > >> >> > > I saw the same question on the users list on June 17. At least
> I
> > >> thought
> > >> >> > it
> > >> >> > > was the same question -- sorry if it wasn't.
> > >> >> > >
> > >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> > >> >> > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > >> >> > >
> > >> >> > > > Well, hold on. He sent that mail (as of the time of this
> mail)
> > 4
> > >> >> > > > mins previously. Maybe some folks need some time to reply ^_^
> > >> >> > > >
> > >> >> > > >
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> >> > > > Chris Mattmann, Ph.D.
> > >> >> > > > Chief Architect
> > >> >> > > > Instrument Software and Science Data Systems Section (398)
> > >> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> >> > > > Office: 168-519, Mailstop: 168-527
> > >> >> > > > Email: chris.a.mattmann@nasa.gov
> > >> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > >> >> > > >
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> >> > > > Director, Information Retrieval and Data Science Group (IRDS)
> > >> >> > > > Adjunct Associate Professor, Computer Science Department
> > >> >> > > > University of Southern California, Los Angeles, CA 90089 USA
> > >> >> > > > WWW: http://irds.usc.edu/
> > >> >> > > >
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <
> jzemerick@apache.org>
> > >> wrote:
> > >> >> > > >
> > >> >> > > > >Hi Mondher,
> > >> >> > > > >
> > >> >> > > > >Since you didn't get any replies I'm guessing no one is
> aware
> > >> of any
> > >> >> > > > >resources related to what you need. Google Scholar is a good
> > >> place to
> > >> >> > > look
> > >> >> > > > >for papers referencing OpenNLP and its methods (in case you
> > >> haven't
> > >> >> > > > >searched it already).
> > >> >> > > > >
> > >> >> > > > >Jeff
> > >> >> > > > >
> > >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > >> >> > > > >mondher.bouazizi@gmail.com> wrote:
> > >> >> > > > >
> > >> >> > > > >> Hi,
> > >> >> > > > >>
> > >> >> > > > >> Apologies if you received multiple copies of this email. I
> > >> sent it
> > >> >> > to
> > >> >> > > > the
> > >> >> > > > >> users list a while ago, and haven't had an answer yet.
> > >> >> > > > >>
> > >> >> > > > >> I have been looking for a while if there is any relevant
> > work
> > >> that
> > >> >> > > > >> performed tests on the OpenNLP tools (in particular the
> > >> Lemmatizer,
> > >> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy
> > >> texts such
> > >> >> > as
> > >> >> > > > >> Twitter data, etc., and/or compared it to other libraries.
> > >> >> > > > >>
> > >> >> > > > >> By performances, I mean accuracy/precision, rather than
> time
> > >> of
> > >> >> > > > execution,
> > >> >> > > > >> etc.
> > >> >> > > > >>
> > >> >> > > > >> If anyone can refer me to a paper or a work done in this
> > >> context,
> > >> >> > that
> > >> >> > > > >> would be of great help.
> > >> >> > > > >>
> > >> >> > > > >> Thank you very much.
> > >> >> > > > >>
> > >> >> > > > >> Mondher
> > >> >> > > > >>
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Performances of OpenNLP tools

Posted by Jason Baldridge <ja...@gmail.com>.

Jörn is absolutely right about that. Another good source of training data
is MASC. I've got some instructions for training models with MASC here:

https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial

Chalk (now defunct) provided a Scala wrapper around OpenNLP functionality,
so the instructions there should make it fairly straightforward to adapt
MASC data to OpenNLP.

-Jason

On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <ko...@gmail.com> wrote:

> There are some research papers which study and compare the performance of
> NLP toolkits, but be careful often they don't train the NLP tools on the
> same data and the training data makes a big difference on the performance.
>
> Jörn
>
> On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <ko...@gmail.com>
> wrote:
>
> > Just don't use the very old existing models, to get good results you have
> > to train on your own data, especially if the domain of the data used for
> > training and the data which should be processed doesn't match. The old
> > models are trained on 90s news, those don't work well on todays news and
> > probably much worse on tweets.
> >
> > OntoNots is a good place to start if the goal is to process news. OpenNLP
> > comes with build-in support to train models from OntoNotes.
> >
> > Jörn
> >
> > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> >> This sounds like a fantastic idea.
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Director, Information Retrieval and Data Science Group (IRDS)
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> WWW: http://irds.usc.edu/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <anthonybeylerian@hotmail.com
> >
> >> wrote:
> >>
> >> >+1
> >> >
> >> >Maybe we could put the results of the evaluator tests for each
> component
> >> somewhere on a webpage and on every release update them.
> >> >This is of course provided there are reasonable data sets for testing
> >> each component.
> >> >What do you think?
> >> >
> >> >Anthony
> >> >
> >> >> From: mondher.bouazizi@gmail.com
> >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> >> >> Subject: Re: Performances of OpenNLP tools
> >> >> To: dev@opennlp.apache.org
> >> >>
> >> >> Hi,
> >> >>
> >> >> Thank you for your replies.
> >> >>
> >> >> Please Jeffrey accept once more my apologies for receiving the email
> >> twice.
> >> >>
> >> >> I also think it would be great to have such studies on the
> >> performances of
> >> >> OpenNLP.
> >> >>
> >> >> I have been looking for this information and checked in many places,
> >> >> including obviously google scholar, and I haven't found any serious
> >> studies
> >> >> or reliable results. Most of the existing ones report the
> performances
> >> of
> >> >> outdated releases of OpenNLP, and focus more on the execution time or
> >> >> CPU/RAM consumption, etc.
> >> >>
> >> >> I think such a comparison will help not only evaluate the overall
> >> accuracy,
> >> >> but also highlight the issues with the existing models (as a matter
> of
> >> >> fact, the existing models fail to recognize many of the hashtags in
> >> tweets:
> >> >> the tokenizer splits them into the "#" symbol and a word that the PoS
> >> >> tagger also fails to recognize).
> >> >>
> >> >> Therefore, building Twitter-based models would also be useful, since
> >> many
> >> >> of the works in academia / industry are focusing on Twitter data.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> Mondher
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> >> jasonbaldridge@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > It would be fantastic to have these numbers. This is an example of
> >> >> > something that would be a great contribution by someone trying to
> >> >> > contribute to open source and who is maybe just getting into
> machine
> >> >> > learning and natural language processing.
> >> >> >
> >> >> > For Twitter-ish text, it'd be great to look at models trained and
> >> evaluated
> >> >> > on the Tweet NLP resources:
> >> >> >
> >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> >> >> >
> >> >> > And comparing to how their models performed, etc. Also, it's worth
> >> looking
> >> >> > at spaCy (Python NLP modules) for further comparisons.
> >> >> >
> >> >> > https://spacy.io/
> >> >> >
> >> >> > -Jason
> >> >> >
> >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <
> jzemerick@apache.org>
> >> >> > wrote:
> >> >> >
> >> >> > > I saw the same question on the users list on June 17. At least I
> >> thought
> >> >> > it
> >> >> > > was the same question -- sorry if it wasn't.
> >> >> > >
> >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> >> >> > > chris.a.mattmann@jpl.nasa.gov> wrote:
> >> >> > >
> >> >> > > > Well, hold on. He sent that mail (as of the time of this mail)
> 4
> >> >> > > > mins previously. Maybe some folks need some time to reply ^_^
> >> >> > > >
> >> >> > > >
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> > > > Chris Mattmann, Ph.D.
> >> >> > > > Chief Architect
> >> >> > > > Instrument Software and Science Data Systems Section (398)
> >> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> >> > > > Office: 168-519, Mailstop: 168-527
> >> >> > > > Email: chris.a.mattmann@nasa.gov
> >> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> >> >> > > >
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> > > > Director, Information Retrieval and Data Science Group (IRDS)
> >> >> > > > Adjunct Associate Professor, Computer Science Department
> >> >> > > > University of Southern California, Los Angeles, CA 90089 USA
> >> >> > > > WWW: http://irds.usc.edu/
> >> >> > > >
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org>
> >> wrote:
> >> >> > > >
> >> >> > > > >Hi Mondher,
> >> >> > > > >
> >> >> > > > >Since you didn't get any replies I'm guessing no one is aware
> >> of any
> >> >> > > > >resources related to what you need. Google Scholar is a good
> >> place to
> >> >> > > look
> >> >> > > > >for papers referencing OpenNLP and its methods (in case you
> >> haven't
> >> >> > > > >searched it already).
> >> >> > > > >
> >> >> > > > >Jeff
> >> >> > > > >
> >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> >> >> > > > >mondher.bouazizi@gmail.com> wrote:
> >> >> > > > >
> >> >> > > > >> Hi,
> >> >> > > > >>
> >> >> > > > >> Apologies if you received multiple copies of this email. I
> >> sent it
> >> >> > to
> >> >> > > > the
> >> >> > > > >> users list a while ago, and haven't had an answer yet.
> >> >> > > > >>
> >> >> > > > >> I have been looking for a while if there is any relevant
> work
> >> that
> >> >> > > > >> performed tests on the OpenNLP tools (in particular the
> >> Lemmatizer,
> >> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy
> >> texts such
> >> >> > as
> >> >> > > > >> Twitter data, etc., and/or compared it to other libraries.
> >> >> > > > >>
> >> >> > > > >> By performances, I mean accuracy/precision, rather than time
> >> of
> >> >> > > > execution,
> >> >> > > > >> etc.
> >> >> > > > >>
> >> >> > > > >> If anyone can refer me to a paper or a work done in this
> >> context,
> >> >> > that
> >> >> > > > >> would be of great help.
> >> >> > > > >>
> >> >> > > > >> Thank you very much.
> >> >> > > > >>
> >> >> > > > >> Mondher
> >> >> > > > >>
> >> >> > > >
> >> >> > >
> >> >> >
> >> >
> >>
> >
> >
>

Re: Performances of OpenNLP tools

Posted by Joern Kottmann <ko...@gmail.com>.

There are some research papers which study and compare the performance of
NLP toolkits, but be careful often they don't train the NLP tools on the
same data and the training data makes a big difference on the performance.

Jörn

On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <ko...@gmail.com> wrote:

> Just don't use the very old existing models, to get good results you have
> to train on your own data, especially if the domain of the data used for
> training and the data which should be processed doesn't match. The old
> models are trained on 90s news, those don't work well on todays news and
> probably much worse on tweets.
>
> OntoNots is a good place to start if the goal is to process news. OpenNLP
> comes with build-in support to train models from OntoNotes.
>
> Jörn
>
> On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> This sounds like a fantastic idea.
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6/21/16, 12:13 AM, "Anthony Beylerian" <an...@hotmail.com>
>> wrote:
>>
>> >+1
>> >
>> >Maybe we could put the results of the evaluator tests for each component
>> somewhere on a webpage and on every release update them.
>> >This is of course provided there are reasonable data sets for testing
>> each component.
>> >What do you think?
>> >
>> >Anthony
>> >
>> >> From: mondher.bouazizi@gmail.com
>> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
>> >> Subject: Re: Performances of OpenNLP tools
>> >> To: dev@opennlp.apache.org
>> >>
>> >> Hi,
>> >>
>> >> Thank you for your replies.
>> >>
>> >> Please Jeffrey accept once more my apologies for receiving the email
>> twice.
>> >>
>> >> I also think it would be great to have such studies on the
>> performances of
>> >> OpenNLP.
>> >>
>> >> I have been looking for this information and checked in many places,
>> >> including obviously google scholar, and I haven't found any serious
>> studies
>> >> or reliable results. Most of the existing ones report the performances
>> of
>> >> outdated releases of OpenNLP, and focus more on the execution time or
>> >> CPU/RAM consumption, etc.
>> >>
>> >> I think such a comparison will help not only evaluate the overall
>> accuracy,
>> >> but also highlight the issues with the existing models (as a matter of
>> >> fact, the existing models fail to recognize many of the hashtags in
>> tweets:
>> >> the tokenizer splits them into the "#" symbol and a word that the PoS
>> >> tagger also fails to recognize).
>> >>
>> >> Therefore, building Twitter-based models would also be useful, since
>> many
>> >> of the works in academia / industry are focusing on Twitter data.
>> >>
>> >> Best regards,
>> >>
>> >> Mondher
>> >>
>> >>
>> >>
>> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
>> jasonbaldridge@gmail.com>
>> >> wrote:
>> >>
>> >> > It would be fantastic to have these numbers. This is an example of
>> >> > something that would be a great contribution by someone trying to
>> >> > contribute to open source and who is maybe just getting into machine
>> >> > learning and natural language processing.
>> >> >
>> >> > For Twitter-ish text, it'd be great to look at models trained and
>> evaluated
>> >> > on the Tweet NLP resources:
>> >> >
>> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
>> >> >
>> >> > And comparing to how their models performed, etc. Also, it's worth
>> looking
>> >> > at spaCy (Python NLP modules) for further comparisons.
>> >> >
>> >> > https://spacy.io/
>> >> >
>> >> > -Jason
>> >> >
>> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jz...@apache.org>
>> >> > wrote:
>> >> >
>> >> > > I saw the same question on the users list on June 17. At least I
>> thought
>> >> > it
>> >> > > was the same question -- sorry if it wasn't.
>> >> > >
>> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
>> >> > > chris.a.mattmann@jpl.nasa.gov> wrote:
>> >> > >
>> >> > > > Well, hold on. He sent that mail (as of the time of this mail) 4
>> >> > > > mins previously. Maybe some folks need some time to reply ^_^
>> >> > > >
>> >> > > >
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > Chris Mattmann, Ph.D.
>> >> > > > Chief Architect
>> >> > > > Instrument Software and Science Data Systems Section (398)
>> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >> > > > Office: 168-519, Mailstop: 168-527
>> >> > > > Email: chris.a.mattmann@nasa.gov
>> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
>> >> > > >
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > Director, Information Retrieval and Data Science Group (IRDS)
>> >> > > > Adjunct Associate Professor, Computer Science Department
>> >> > > > University of Southern California, Los Angeles, CA 90089 USA
>> >> > > > WWW: http://irds.usc.edu/
>> >> > > >
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org>
>> wrote:
>> >> > > >
>> >> > > > >Hi Mondher,
>> >> > > > >
>> >> > > > >Since you didn't get any replies I'm guessing no one is aware
>> of any
>> >> > > > >resources related to what you need. Google Scholar is a good
>> place to
>> >> > > look
>> >> > > > >for papers referencing OpenNLP and its methods (in case you
>> haven't
>> >> > > > >searched it already).
>> >> > > > >
>> >> > > > >Jeff
>> >> > > > >
>> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
>> >> > > > >mondher.bouazizi@gmail.com> wrote:
>> >> > > > >
>> >> > > > >> Hi,
>> >> > > > >>
>> >> > > > >> Apologies if you received multiple copies of this email. I
>> sent it
>> >> > to
>> >> > > > the
>> >> > > > >> users list a while ago, and haven't had an answer yet.
>> >> > > > >>
>> >> > > > >> I have been looking for a while if there is any relevant work
>> that
>> >> > > > >> performed tests on the OpenNLP tools (in particular the
>> Lemmatizer,
>> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy
>> texts such
>> >> > as
>> >> > > > >> Twitter data, etc., and/or compared it to other libraries.
>> >> > > > >>
>> >> > > > >> By performances, I mean accuracy/precision, rather than time
>> of
>> >> > > > execution,
>> >> > > > >> etc.
>> >> > > > >>
>> >> > > > >> If anyone can refer me to a paper or a work done in this
>> context,
>> >> > that
>> >> > > > >> would be of great help.
>> >> > > > >>
>> >> > > > >> Thank you very much.
>> >> > > > >>
>> >> > > > >> Mondher
>> >> > > > >>
>> >> > > >
>> >> > >
>> >> >
>> >
>>
>
>

Re: Performances of OpenNLP tools

Posted by Joern Kottmann <ko...@gmail.com>.

Just don't use the very old existing models, to get good results you have
to train on your own data, especially if the domain of the data used for
training and the data which should be processed doesn't match. The old
models are trained on 90s news, those don't work well on todays news and
probably much worse on tweets.

OntoNots is a good place to start if the goal is to process news. OpenNLP
comes with build-in support to train models from OntoNotes.

Jörn

On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> This sounds like a fantastic idea.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
> On 6/21/16, 12:13 AM, "Anthony Beylerian" <an...@hotmail.com>
> wrote:
>
> >+1
> >
> >Maybe we could put the results of the evaluator tests for each component
> somewhere on a webpage and on every release update them.
> >This is of course provided there are reasonable data sets for testing
> each component.
> >What do you think?
> >
> >Anthony
> >
> >> From: mondher.bouazizi@gmail.com
> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> >> Subject: Re: Performances of OpenNLP tools
> >> To: dev@opennlp.apache.org
> >>
> >> Hi,
> >>
> >> Thank you for your replies.
> >>
> >> Please Jeffrey accept once more my apologies for receiving the email
> twice.
> >>
> >> I also think it would be great to have such studies on the performances
> of
> >> OpenNLP.
> >>
> >> I have been looking for this information and checked in many places,
> >> including obviously google scholar, and I haven't found any serious
> studies
> >> or reliable results. Most of the existing ones report the performances
> of
> >> outdated releases of OpenNLP, and focus more on the execution time or
> >> CPU/RAM consumption, etc.
> >>
> >> I think such a comparison will help not only evaluate the overall
> accuracy,
> >> but also highlight the issues with the existing models (as a matter of
> >> fact, the existing models fail to recognize many of the hashtags in
> tweets:
> >> the tokenizer splits them into the "#" symbol and a word that the PoS
> >> tagger also fails to recognize).
> >>
> >> Therefore, building Twitter-based models would also be useful, since
> many
> >> of the works in academia / industry are focusing on Twitter data.
> >>
> >> Best regards,
> >>
> >> Mondher
> >>
> >>
> >>
> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> jasonbaldridge@gmail.com>
> >> wrote:
> >>
> >> > It would be fantastic to have these numbers. This is an example of
> >> > something that would be a great contribution by someone trying to
> >> > contribute to open source and who is maybe just getting into machine
> >> > learning and natural language processing.
> >> >
> >> > For Twitter-ish text, it'd be great to look at models trained and
> evaluated
> >> > on the Tweet NLP resources:
> >> >
> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> >> >
> >> > And comparing to how their models performed, etc. Also, it's worth
> looking
> >> > at spaCy (Python NLP modules) for further comparisons.
> >> >
> >> > https://spacy.io/
> >> >
> >> > -Jason
> >> >
> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jz...@apache.org>
> >> > wrote:
> >> >
> >> > > I saw the same question on the users list on June 17. At least I
> thought
> >> > it
> >> > > was the same question -- sorry if it wasn't.
> >> > >
> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> >> > > chris.a.mattmann@jpl.nasa.gov> wrote:
> >> > >
> >> > > > Well, hold on. He sent that mail (as of the time of this mail) 4
> >> > > > mins previously. Maybe some folks need some time to reply ^_^
> >> > > >
> >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> > > > Chris Mattmann, Ph.D.
> >> > > > Chief Architect
> >> > > > Instrument Software and Science Data Systems Section (398)
> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> > > > Office: 168-519, Mailstop: 168-527
> >> > > > Email: chris.a.mattmann@nasa.gov
> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> > > > Director, Information Retrieval and Data Science Group (IRDS)
> >> > > > Adjunct Associate Professor, Computer Science Department
> >> > > > University of Southern California, Los Angeles, CA 90089 USA
> >> > > > WWW: http://irds.usc.edu/
> >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org>
> wrote:
> >> > > >
> >> > > > >Hi Mondher,
> >> > > > >
> >> > > > >Since you didn't get any replies I'm guessing no one is aware of
> any
> >> > > > >resources related to what you need. Google Scholar is a good
> place to
> >> > > look
> >> > > > >for papers referencing OpenNLP and its methods (in case you
> haven't
> >> > > > >searched it already).
> >> > > > >
> >> > > > >Jeff
> >> > > > >
> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> >> > > > >mondher.bouazizi@gmail.com> wrote:
> >> > > > >
> >> > > > >> Hi,
> >> > > > >>
> >> > > > >> Apologies if you received multiple copies of this email. I
> sent it
> >> > to
> >> > > > the
> >> > > > >> users list a while ago, and haven't had an answer yet.
> >> > > > >>
> >> > > > >> I have been looking for a while if there is any relevant work
> that
> >> > > > >> performed tests on the OpenNLP tools (in particular the
> Lemmatizer,
> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy texts
> such
> >> > as
> >> > > > >> Twitter data, etc., and/or compared it to other libraries.
> >> > > > >>
> >> > > > >> By performances, I mean accuracy/precision, rather than time of
> >> > > > execution,
> >> > > > >> etc.
> >> > > > >>
> >> > > > >> If anyone can refer me to a paper or a work done in this
> context,
> >> > that
> >> > > > >> would be of great help.
> >> > > > >>
> >> > > > >> Thank you very much.
> >> > > > >>
> >> > > > >> Mondher
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >
>

Re: Performances of OpenNLP tools

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

This sounds like a fantastic idea.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 6/21/16, 12:13 AM, "Anthony Beylerian" <an...@hotmail.com> wrote:

>+1 
>
>Maybe we could put the results of the evaluator tests for each component somewhere on a webpage and on every release update them.
>This is of course provided there are reasonable data sets for testing each component.
>What do you think?
>
>Anthony
>
>> From: mondher.bouazizi@gmail.com
>> Date: Tue, 21 Jun 2016 15:59:47 +0900
>> Subject: Re: Performances of OpenNLP tools
>> To: dev@opennlp.apache.org
>> 
>> Hi,
>> 
>> Thank you for your replies.
>> 
>> Please Jeffrey accept once more my apologies for receiving the email twice.
>> 
>> I also think it would be great to have such studies on the performances of
>> OpenNLP.
>> 
>> I have been looking for this information and checked in many places,
>> including obviously google scholar, and I haven't found any serious studies
>> or reliable results. Most of the existing ones report the performances of
>> outdated releases of OpenNLP, and focus more on the execution time or
>> CPU/RAM consumption, etc.
>> 
>> I think such a comparison will help not only evaluate the overall accuracy,
>> but also highlight the issues with the existing models (as a matter of
>> fact, the existing models fail to recognize many of the hashtags in tweets:
>> the tokenizer splits them into the "#" symbol and a word that the PoS
>> tagger also fails to recognize).
>> 
>> Therefore, building Twitter-based models would also be useful, since many
>> of the works in academia / industry are focusing on Twitter data.
>> 
>> Best regards,
>> 
>> Mondher
>> 
>> 
>> 
>> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <ja...@gmail.com>
>> wrote:
>> 
>> > It would be fantastic to have these numbers. This is an example of
>> > something that would be a great contribution by someone trying to
>> > contribute to open source and who is maybe just getting into machine
>> > learning and natural language processing.
>> >
>> > For Twitter-ish text, it'd be great to look at models trained and evaluated
>> > on the Tweet NLP resources:
>> >
>> > http://www.cs.cmu.edu/~ark/TweetNLP/
>> >
>> > And comparing to how their models performed, etc. Also, it's worth looking
>> > at spaCy (Python NLP modules) for further comparisons.
>> >
>> > https://spacy.io/
>> >
>> > -Jason
>> >
>> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jz...@apache.org>
>> > wrote:
>> >
>> > > I saw the same question on the users list on June 17. At least I thought
>> > it
>> > > was the same question -- sorry if it wasn't.
>> > >
>> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
>> > > chris.a.mattmann@jpl.nasa.gov> wrote:
>> > >
>> > > > Well, hold on. He sent that mail (as of the time of this mail) 4
>> > > > mins previously. Maybe some folks need some time to reply ^_^
>> > > >
>> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > > > Chris Mattmann, Ph.D.
>> > > > Chief Architect
>> > > > Instrument Software and Science Data Systems Section (398)
>> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > > > Office: 168-519, Mailstop: 168-527
>> > > > Email: chris.a.mattmann@nasa.gov
>> > > > WWW:  http://sunset.usc.edu/~mattmann/
>> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > > > Director, Information Retrieval and Data Science Group (IRDS)
>> > > > Adjunct Associate Professor, Computer Science Department
>> > > > University of Southern California, Los Angeles, CA 90089 USA
>> > > > WWW: http://irds.usc.edu/
>> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org> wrote:
>> > > >
>> > > > >Hi Mondher,
>> > > > >
>> > > > >Since you didn't get any replies I'm guessing no one is aware of any
>> > > > >resources related to what you need. Google Scholar is a good place to
>> > > look
>> > > > >for papers referencing OpenNLP and its methods (in case you haven't
>> > > > >searched it already).
>> > > > >
>> > > > >Jeff
>> > > > >
>> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
>> > > > >mondher.bouazizi@gmail.com> wrote:
>> > > > >
>> > > > >> Hi,
>> > > > >>
>> > > > >> Apologies if you received multiple copies of this email. I sent it
>> > to
>> > > > the
>> > > > >> users list a while ago, and haven't had an answer yet.
>> > > > >>
>> > > > >> I have been looking for a while if there is any relevant work that
>> > > > >> performed tests on the OpenNLP tools (in particular the Lemmatizer,
>> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy texts such
>> > as
>> > > > >> Twitter data, etc., and/or compared it to other libraries.
>> > > > >>
>> > > > >> By performances, I mean accuracy/precision, rather than time of
>> > > > execution,
>> > > > >> etc.
>> > > > >>
>> > > > >> If anyone can refer me to a paper or a work done in this context,
>> > that
>> > > > >> would be of great help.
>> > > > >>
>> > > > >> Thank you very much.
>> > > > >>
>> > > > >> Mondher
>> > > > >>
>> > > >
>> > >
>> >
>

RE: Performances of OpenNLP tools

Posted by Anthony Beylerian <an...@hotmail.com>.

+1 

Maybe we could put the results of the evaluator tests for each component somewhere on a webpage and on every release update them.
This is of course provided there are reasonable data sets for testing each component.
What do you think?

Anthony

> From: mondher.bouazizi@gmail.com
> Date: Tue, 21 Jun 2016 15:59:47 +0900
> Subject: Re: Performances of OpenNLP tools
> To: dev@opennlp.apache.org
> 
> Hi,
> 
> Thank you for your replies.
> 
> Please Jeffrey accept once more my apologies for receiving the email twice.
> 
> I also think it would be great to have such studies on the performances of
> OpenNLP.
> 
> I have been looking for this information and checked in many places,
> including obviously google scholar, and I haven't found any serious studies
> or reliable results. Most of the existing ones report the performances of
> outdated releases of OpenNLP, and focus more on the execution time or
> CPU/RAM consumption, etc.
> 
> I think such a comparison will help not only evaluate the overall accuracy,
> but also highlight the issues with the existing models (as a matter of
> fact, the existing models fail to recognize many of the hashtags in tweets:
> the tokenizer splits them into the "#" symbol and a word that the PoS
> tagger also fails to recognize).
> 
> Therefore, building Twitter-based models would also be useful, since many
> of the works in academia / industry are focusing on Twitter data.
> 
> Best regards,
> 
> Mondher
> 
> 
> 
> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <ja...@gmail.com>
> wrote:
> 
> > It would be fantastic to have these numbers. This is an example of
> > something that would be a great contribution by someone trying to
> > contribute to open source and who is maybe just getting into machine
> > learning and natural language processing.
> >
> > For Twitter-ish text, it'd be great to look at models trained and evaluated
> > on the Tweet NLP resources:
> >
> > http://www.cs.cmu.edu/~ark/TweetNLP/
> >
> > And comparing to how their models performed, etc. Also, it's worth looking
> > at spaCy (Python NLP modules) for further comparisons.
> >
> > https://spacy.io/
> >
> > -Jason
> >
> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jz...@apache.org>
> > wrote:
> >
> > > I saw the same question on the users list on June 17. At least I thought
> > it
> > > was the same question -- sorry if it wasn't.
> > >
> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > >
> > > > Well, hold on. He sent that mail (as of the time of this mail) 4
> > > > mins previously. Maybe some folks need some time to reply ^_^
> > > >
> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > Chris Mattmann, Ph.D.
> > > > Chief Architect
> > > > Instrument Software and Science Data Systems Section (398)
> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > Office: 168-519, Mailstop: 168-527
> > > > Email: chris.a.mattmann@nasa.gov
> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > Director, Information Retrieval and Data Science Group (IRDS)
> > > > Adjunct Associate Professor, Computer Science Department
> > > > University of Southern California, Los Angeles, CA 90089 USA
> > > > WWW: http://irds.usc.edu/
> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org> wrote:
> > > >
> > > > >Hi Mondher,
> > > > >
> > > > >Since you didn't get any replies I'm guessing no one is aware of any
> > > > >resources related to what you need. Google Scholar is a good place to
> > > look
> > > > >for papers referencing OpenNLP and its methods (in case you haven't
> > > > >searched it already).
> > > > >
> > > > >Jeff
> > > > >
> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > > > >mondher.bouazizi@gmail.com> wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> Apologies if you received multiple copies of this email. I sent it
> > to
> > > > the
> > > > >> users list a while ago, and haven't had an answer yet.
> > > > >>
> > > > >> I have been looking for a while if there is any relevant work that
> > > > >> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy texts such
> > as
> > > > >> Twitter data, etc., and/or compared it to other libraries.
> > > > >>
> > > > >> By performances, I mean accuracy/precision, rather than time of
> > > > execution,
> > > > >> etc.
> > > > >>
> > > > >> If anyone can refer me to a paper or a work done in this context,
> > that
> > > > >> would be of great help.
> > > > >>
> > > > >> Thank you very much.
> > > > >>
> > > > >> Mondher
> > > > >>
> > > >
> > >
> >

Re: Performances of OpenNLP tools

Posted by Mondher Bouazizi <mo...@gmail.com>.

Hi,

Thank you for your replies.

Please Jeffrey accept once more my apologies for receiving the email twice.

I also think it would be great to have such studies on the performances of
OpenNLP.

I have been looking for this information and checked in many places,
including obviously google scholar, and I haven't found any serious studies
or reliable results. Most of the existing ones report the performances of
outdated releases of OpenNLP, and focus more on the execution time or
CPU/RAM consumption, etc.

I think such a comparison will help not only evaluate the overall accuracy,
but also highlight the issues with the existing models (as a matter of
fact, the existing models fail to recognize many of the hashtags in tweets:
the tokenizer splits them into the "#" symbol and a word that the PoS
tagger also fails to recognize).

Therefore, building Twitter-based models would also be useful, since many
of the works in academia / industry are focusing on Twitter data.

Best regards,

Mondher



On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <ja...@gmail.com>
wrote:

> It would be fantastic to have these numbers. This is an example of
> something that would be a great contribution by someone trying to
> contribute to open source and who is maybe just getting into machine
> learning and natural language processing.
>
> For Twitter-ish text, it'd be great to look at models trained and evaluated
> on the Tweet NLP resources:
>
> http://www.cs.cmu.edu/~ark/TweetNLP/
>
> And comparing to how their models performed, etc. Also, it's worth looking
> at spaCy (Python NLP modules) for further comparisons.
>
> https://spacy.io/
>
> -Jason
>
> On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jz...@apache.org>
> wrote:
>
> > I saw the same question on the users list on June 17. At least I thought
> it
> > was the same question -- sorry if it wasn't.
> >
> > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> > chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> > > Well, hold on. He sent that mail (as of the time of this mail) 4
> > > mins previously. Maybe some folks need some time to reply ^_^
> > >
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > Chris Mattmann, Ph.D.
> > > Chief Architect
> > > Instrument Software and Science Data Systems Section (398)
> > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > Office: 168-519, Mailstop: 168-527
> > > Email: chris.a.mattmann@nasa.gov
> > > WWW:  http://sunset.usc.edu/~mattmann/
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > Director, Information Retrieval and Data Science Group (IRDS)
> > > Adjunct Associate Professor, Computer Science Department
> > > University of Southern California, Los Angeles, CA 90089 USA
> > > WWW: http://irds.usc.edu/
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org> wrote:
> > >
> > > >Hi Mondher,
> > > >
> > > >Since you didn't get any replies I'm guessing no one is aware of any
> > > >resources related to what you need. Google Scholar is a good place to
> > look
> > > >for papers referencing OpenNLP and its methods (in case you haven't
> > > >searched it already).
> > > >
> > > >Jeff
> > > >
> > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > > >mondher.bouazizi@gmail.com> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Apologies if you received multiple copies of this email. I sent it
> to
> > > the
> > > >> users list a while ago, and haven't had an answer yet.
> > > >>
> > > >> I have been looking for a while if there is any relevant work that
> > > >> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> > > >> Tokenizer and PoS-Tagger) when used with short and noisy texts such
> as
> > > >> Twitter data, etc., and/or compared it to other libraries.
> > > >>
> > > >> By performances, I mean accuracy/precision, rather than time of
> > > execution,
> > > >> etc.
> > > >>
> > > >> If anyone can refer me to a paper or a work done in this context,
> that
> > > >> would be of great help.
> > > >>
> > > >> Thank you very much.
> > > >>
> > > >> Mondher
> > > >>
> > >
> >
>

Re: Performances of OpenNLP tools

Posted by Jason Baldridge <ja...@gmail.com>.

It would be fantastic to have these numbers. This is an example of
something that would be a great contribution by someone trying to
contribute to open source and who is maybe just getting into machine
learning and natural language processing.

For Twitter-ish text, it'd be great to look at models trained and evaluated
on the Tweet NLP resources:

http://www.cs.cmu.edu/~ark/TweetNLP/

And comparing to how their models performed, etc. Also, it's worth looking
at spaCy (Python NLP modules) for further comparisons.

https://spacy.io/

-Jason

On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jz...@apache.org> wrote:

> I saw the same question on the users list on June 17. At least I thought it
> was the same question -- sorry if it wasn't.
>
> On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> > Well, hold on. He sent that mail (as of the time of this mail) 4
> > mins previously. Maybe some folks need some time to reply ^_^
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattmann@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Director, Information Retrieval and Data Science Group (IRDS)
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > WWW: http://irds.usc.edu/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org> wrote:
> >
> > >Hi Mondher,
> > >
> > >Since you didn't get any replies I'm guessing no one is aware of any
> > >resources related to what you need. Google Scholar is a good place to
> look
> > >for papers referencing OpenNLP and its methods (in case you haven't
> > >searched it already).
> > >
> > >Jeff
> > >
> > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > >mondher.bouazizi@gmail.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> Apologies if you received multiple copies of this email. I sent it to
> > the
> > >> users list a while ago, and haven't had an answer yet.
> > >>
> > >> I have been looking for a while if there is any relevant work that
> > >> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> > >> Tokenizer and PoS-Tagger) when used with short and noisy texts such as
> > >> Twitter data, etc., and/or compared it to other libraries.
> > >>
> > >> By performances, I mean accuracy/precision, rather than time of
> > execution,
> > >> etc.
> > >>
> > >> If anyone can refer me to a paper or a work done in this context, that
> > >> would be of great help.
> > >>
> > >> Thank you very much.
> > >>
> > >> Mondher
> > >>
> >
>

Re: Performances of OpenNLP tools

Posted by Jeffrey Zemerick <jz...@apache.org>.

I saw the same question on the users list on June 17. At least I thought it
was the same question -- sorry if it wasn't.

On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Well, hold on. He sent that mail (as of the time of this mail) 4
> mins previously. Maybe some folks need some time to reply ^_^
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
> On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org> wrote:
>
> >Hi Mondher,
> >
> >Since you didn't get any replies I'm guessing no one is aware of any
> >resources related to what you need. Google Scholar is a good place to look
> >for papers referencing OpenNLP and its methods (in case you haven't
> >searched it already).
> >
> >Jeff
> >
> >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> >mondher.bouazizi@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> Apologies if you received multiple copies of this email. I sent it to
> the
> >> users list a while ago, and haven't had an answer yet.
> >>
> >> I have been looking for a while if there is any relevant work that
> >> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> >> Tokenizer and PoS-Tagger) when used with short and noisy texts such as
> >> Twitter data, etc., and/or compared it to other libraries.
> >>
> >> By performances, I mean accuracy/precision, rather than time of
> execution,
> >> etc.
> >>
> >> If anyone can refer me to a paper or a work done in this context, that
> >> would be of great help.
> >>
> >> Thank you very much.
> >>
> >> Mondher
> >>
>

Re: Performances of OpenNLP tools

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Well, hold on. He sent that mail (as of the time of this mail) 4
mins previously. Maybe some folks need some time to reply ^_^

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jz...@apache.org> wrote:

>Hi Mondher,
>
>Since you didn't get any replies I'm guessing no one is aware of any
>resources related to what you need. Google Scholar is a good place to look
>for papers referencing OpenNLP and its methods (in case you haven't
>searched it already).
>
>Jeff
>
>On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
>mondher.bouazizi@gmail.com> wrote:
>
>> Hi,
>>
>> Apologies if you received multiple copies of this email. I sent it to the
>> users list a while ago, and haven't had an answer yet.
>>
>> I have been looking for a while if there is any relevant work that
>> performed tests on the OpenNLP tools (in particular the Lemmatizer,
>> Tokenizer and PoS-Tagger) when used with short and noisy texts such as
>> Twitter data, etc., and/or compared it to other libraries.
>>
>> By performances, I mean accuracy/precision, rather than time of execution,
>> etc.
>>
>> If anyone can refer me to a paper or a work done in this context, that
>> would be of great help.
>>
>> Thank you very much.
>>
>> Mondher
>>

Re: Performances of OpenNLP tools

Posted by Jeffrey Zemerick <jz...@apache.org>.

Hi Mondher,

Since you didn't get any replies I'm guessing no one is aware of any
resources related to what you need. Google Scholar is a good place to look
for papers referencing OpenNLP and its methods (in case you haven't
searched it already).

Jeff

On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
mondher.bouazizi@gmail.com> wrote:

> Hi,
>
> Apologies if you received multiple copies of this email. I sent it to the
> users list a while ago, and haven't had an answer yet.
>
> I have been looking for a while if there is any relevant work that
> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> Tokenizer and PoS-Tagger) when used with short and noisy texts such as
> Twitter data, etc., and/or compared it to other libraries.
>
> By performances, I mean accuracy/precision, rather than time of execution,
> etc.
>
> If anyone can refer me to a paper or a work done in this context, that
> would be of great help.
>
> Thank you very much.
>
> Mondher
>