You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Chen, Pei" <Pe...@childrens.harvard.edu> on 2013/04/08 18:15:10 UTC

ClearNLP POSTagger

Hi,
While working on the Dependency Parser/SRL labeler,  we also have a POSTagger from ClearNLP.  It is fairly simple and I have the code ready (also trained on the same data as the dep parser- MiPaq/SHARP) to be checked-in.  What does the folks think:
We can include both Analysis Engines in the ctakes-pos-tagger project.  But should we leave the current OpenNLP in the default pipeline or default to the latest?

"The ClearNLP POS tagger shows more robust results on unknown words by generalizing lexical features.  You can find the reference from this paper.
Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection, Jinho D. Choi, Martha Palmer, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL'12), 363-367, Jeju, Korea, 2012. [1] It also uses AdaGrad for machine learning, which is a more advanced learning algorithm than maximum entropy used by OpenNLP."

[1] http://aclweb.org/anthology-new/P/P12/P12-2071.pdf

Re: ClearNLP POSTagger

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/09/2013 10:42 PM, Chen, Pei wrote:
> Let me know if you get a chance to try it out/run some benchmarks see how it performs against the current.

The OpenNLP POS Tagger has built in evaluation, if you have test data 
you could run the evaluator on it or the cross evaluator if you only 
have training data.

Jörn

RE: ClearNLP POSTagger

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Thanks James.
Good idea,
It's been moved to a clearnlp folder now (to indicate that it's a clearnlp model).
org/apache/ctakes/postagger/models/clearnlp/mayo-en-pos-1.3.0.jar

Let me know if you get a chance to try it out/run some benchmarks see how it performs against the current.

--Pei

> -----Original Message-----
> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
> Sent: Tuesday, April 09, 2013 4:31 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: ClearNLP POSTagger
> 
> That's great. Thanks.
> 
> Is there something that describes which model to use for which AE.
> Or maybe put something in the model filename, or put the model in a
> separate subdirectory?
> 
> -- James
> 
> 
> > -----Original Message-----
> > From: dev-return-1482-Masanz.James=mayo.edu@ctakes.apache.org
> > [mailto:dev- return-1482-Masanz.James=mayo.edu@ctakes.apache.org]
> On
> > Behalf Of Chen, Pei
> > Sent: Tuesday, April 09, 2013 3:29 PM
> > To: dev@ctakes.apache.org
> > Subject: RE: ClearNLP POSTagger
> >
> > FYI:
> > This has been done in trunk in r. 1466216
> > https://issues.apache.org/jira/browse/CTAKES-186
> > If you would like to try it out or run some benchmarks before we
> > decide if we should make the default pipeline use this, just uncomment
> > the below in your Aggregate Descriptors.
> >
> > <delegateAnalysisEngine key="ClearPOSTagger"> <import
> > location="../../../ctakes-pos-tagger/desc/ClearNLPPOSTagger.xml"/>
> > </delegateAnalysisEngine>
> > <node>ClearPOSTagger</node>
> >
> > > -----Original Message-----
> > > From: Chen, Pei [mailto:Pei.Chen@childrens.harvard.edu]
> > > Sent: Monday, April 08, 2013 5:14 PM
> > > To: dev@ctakes.apache.org
> > > Subject: RE: ClearNLP POSTagger
> > >
> > > Hi Richard,
> > > Yes- the ClearNLP tools (POSTagger, Dependency Parser, SRL) in
> > > cTAKES were retrained with additional data (MiPAQ/SHARP).
> > > The Dependency Parser/SRL replaced the existing one because the old
> > > ClearParser ones were no longer supported.
> > >
> > > The ClearPOSTagger wasn't previously available in cTAKES, but we can
> > > certainly make it an optional one in case some folks may want to use
> > > it.  I'll leave the default one (OpenNLP) as-is for the time being
> > > until we get some more users/tests/benchmarks/feedback...
> > >
> > > --Pei
> > >
> > > > -----Original Message-----
> > > > From: Richard Eckart de Castilho [mailto:eckart@ukp.informatik.tu-
> > > > darmstadt.de]
> > > > Sent: Monday, April 08, 2013 1:43 PM
> > > > To: <de...@ctakes.apache.org>
> > > > Subject: Re: ClearNLP POSTagger
> > > >
> > > > Hi,
> > > >
> > > > did you train new models for the ClearNLP/OpenNLP tools? (Maybe I
> > > > knew if I had followed a past discussion on models more closely.)
> > > >
> > > > Cheers,
> > > >
> > > > -- Richard
> > > >
> > > > Am 08.04.2013 um 18:15 schrieb "Chen, Pei"
> > > > <Pe...@childrens.harvard.edu>:
> > > >
> > > > > Hi,
> > > > > While working on the Dependency Parser/SRL labeler,  we also
> > > > > have a
> > > > POSTagger from ClearNLP.  It is fairly simple and I have the code
> > > > ready (also trained on the same data as the dep parser-
> > > > MiPaq/SHARP) to
> > > be checked-in.
> > > > What does the folks think:
> > > > > We can include both Analysis Engines in the ctakes-pos-tagger
> > > > > project.  But
> > > > should we leave the current OpenNLP in the default pipeline or
> > > > default to the latest?
> > > > >
> > > > > "The ClearNLP POS tagger shows more robust results on unknown
> > > > > words
> > > > by generalizing lexical features.  You can find the reference from
> > this paper.
> > > > > Fast and Robust Part-of-Speech Tagging Using Dynamic Model
> > > > > Selection,
> > > > Jinho D. Choi, Martha Palmer, Proceedings of the 50th Annual
> > > > Meeting of the Association for Computational Linguistics (ACL'12),
> > > > 363-367, Jeju,
> > > Korea, 2012.
> > > > [1] It also uses AdaGrad for machine learning, which is a more
> > > > advanced learning algorithm than maximum entropy used by
> OpenNLP."
> > > > >
> > > > > [1] http://aclweb.org/anthology-new/P/P12/P12-2071.pdf
> > > >
> > > >
> > > > --
> > > > ------------------------------------------------------------------
> > > > -
> > > > Richard Eckart de Castilho
> > > > Technical Lead
> > > > Ubiquitous Knowledge Processing Lab (UKP-TUD) FB 20 Computer
> > > > Science Department Technische Universität Darmstadt Hochschulstr.
> > > > 10,
> > > > D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455,
> > > > room
> > > > S2/02/B117 eckart@ukp.informatik.tu-darmstadt.de
> > > > www.ukp.tu-darmstadt.de
> > > > Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
> > > > ------------------------------------------------------------------
> > > > -

RE: ClearNLP POSTagger

Posted by "Masanz, James J." <Ma...@mayo.edu>.

That's great. Thanks.

Is there something that describes which model to use for which AE. 
Or maybe put something in the model filename, or put the model in a separate subdirectory?

-- James


> -----Original Message-----
> From: dev-return-1482-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-
> return-1482-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Chen,
> Pei
> Sent: Tuesday, April 09, 2013 3:29 PM
> To: dev@ctakes.apache.org
> Subject: RE: ClearNLP POSTagger
> 
> FYI:
> This has been done in trunk in r. 1466216
> https://issues.apache.org/jira/browse/CTAKES-186
> If you would like to try it out or run some benchmarks before we decide if
> we should make the default pipeline use this, just uncomment the below in
> your Aggregate Descriptors.
> 
> <delegateAnalysisEngine key="ClearPOSTagger"> <import
> location="../../../ctakes-pos-tagger/desc/ClearNLPPOSTagger.xml"/>
> </delegateAnalysisEngine>
> <node>ClearPOSTagger</node>
> 
> > -----Original Message-----
> > From: Chen, Pei [mailto:Pei.Chen@childrens.harvard.edu]
> > Sent: Monday, April 08, 2013 5:14 PM
> > To: dev@ctakes.apache.org
> > Subject: RE: ClearNLP POSTagger
> >
> > Hi Richard,
> > Yes- the ClearNLP tools (POSTagger, Dependency Parser, SRL) in cTAKES
> > were retrained with additional data (MiPAQ/SHARP).
> > The Dependency Parser/SRL replaced the existing one because the old
> > ClearParser ones were no longer supported.
> >
> > The ClearPOSTagger wasn't previously available in cTAKES, but we can
> > certainly make it an optional one in case some folks may want to use
> > it.  I'll leave the default one (OpenNLP) as-is for the time being
> > until we get some more users/tests/benchmarks/feedback...
> >
> > --Pei
> >
> > > -----Original Message-----
> > > From: Richard Eckart de Castilho [mailto:eckart@ukp.informatik.tu-
> > > darmstadt.de]
> > > Sent: Monday, April 08, 2013 1:43 PM
> > > To: <de...@ctakes.apache.org>
> > > Subject: Re: ClearNLP POSTagger
> > >
> > > Hi,
> > >
> > > did you train new models for the ClearNLP/OpenNLP tools? (Maybe I
> > > knew if I had followed a past discussion on models more closely.)
> > >
> > > Cheers,
> > >
> > > -- Richard
> > >
> > > Am 08.04.2013 um 18:15 schrieb "Chen, Pei"
> > > <Pe...@childrens.harvard.edu>:
> > >
> > > > Hi,
> > > > While working on the Dependency Parser/SRL labeler,  we also have
> > > > a
> > > POSTagger from ClearNLP.  It is fairly simple and I have the code
> > > ready (also trained on the same data as the dep parser- MiPaq/SHARP)
> > > to
> > be checked-in.
> > > What does the folks think:
> > > > We can include both Analysis Engines in the ctakes-pos-tagger
> > > > project.  But
> > > should we leave the current OpenNLP in the default pipeline or
> > > default to the latest?
> > > >
> > > > "The ClearNLP POS tagger shows more robust results on unknown
> > > > words
> > > by generalizing lexical features.  You can find the reference from
> this paper.
> > > > Fast and Robust Part-of-Speech Tagging Using Dynamic Model
> > > > Selection,
> > > Jinho D. Choi, Martha Palmer, Proceedings of the 50th Annual Meeting
> > > of the Association for Computational Linguistics (ACL'12), 363-367,
> > > Jeju,
> > Korea, 2012.
> > > [1] It also uses AdaGrad for machine learning, which is a more
> > > advanced learning algorithm than maximum entropy used by OpenNLP."
> > > >
> > > > [1] http://aclweb.org/anthology-new/P/P12/P12-2071.pdf
> > >
> > >
> > > --
> > > -------------------------------------------------------------------
> > > Richard Eckart de Castilho
> > > Technical Lead
> > > Ubiquitous Knowledge Processing Lab (UKP-TUD) FB 20 Computer Science
> > > Department Technische Universität Darmstadt Hochschulstr. 10,
> > > D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455,
> > > room
> > > S2/02/B117 eckart@ukp.informatik.tu-darmstadt.de
> > > www.ukp.tu-darmstadt.de
> > > Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
> > > -------------------------------------------------------------------

RE: ClearNLP POSTagger

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

FYI:
This has been done in trunk in r. 1466216
https://issues.apache.org/jira/browse/CTAKES-186
If you would like to try it out or run some benchmarks before we decide if we should make the default pipeline use this, just uncomment the below in your Aggregate Descriptors.

<delegateAnalysisEngine key="ClearPOSTagger">
<import location="../../../ctakes-pos-tagger/desc/ClearNLPPOSTagger.xml"/>
</delegateAnalysisEngine>
<node>ClearPOSTagger</node> 

> -----Original Message-----
> From: Chen, Pei [mailto:Pei.Chen@childrens.harvard.edu]
> Sent: Monday, April 08, 2013 5:14 PM
> To: dev@ctakes.apache.org
> Subject: RE: ClearNLP POSTagger
> 
> Hi Richard,
> Yes- the ClearNLP tools (POSTagger, Dependency Parser, SRL) in cTAKES
> were retrained with additional data (MiPAQ/SHARP).
> The Dependency Parser/SRL replaced the existing one because the old
> ClearParser ones were no longer supported.
> 
> The ClearPOSTagger wasn't previously available in cTAKES, but we can
> certainly make it an optional one in case some folks may want to use it.  I'll
> leave the default one (OpenNLP) as-is for the time being until we get some
> more users/tests/benchmarks/feedback...
> 
> --Pei
> 
> > -----Original Message-----
> > From: Richard Eckart de Castilho [mailto:eckart@ukp.informatik.tu-
> > darmstadt.de]
> > Sent: Monday, April 08, 2013 1:43 PM
> > To: <de...@ctakes.apache.org>
> > Subject: Re: ClearNLP POSTagger
> >
> > Hi,
> >
> > did you train new models for the ClearNLP/OpenNLP tools? (Maybe I knew
> > if I had followed a past discussion on models more closely.)
> >
> > Cheers,
> >
> > -- Richard
> >
> > Am 08.04.2013 um 18:15 schrieb "Chen, Pei"
> > <Pe...@childrens.harvard.edu>:
> >
> > > Hi,
> > > While working on the Dependency Parser/SRL labeler,  we also have a
> > POSTagger from ClearNLP.  It is fairly simple and I have the code
> > ready (also trained on the same data as the dep parser- MiPaq/SHARP) to
> be checked-in.
> > What does the folks think:
> > > We can include both Analysis Engines in the ctakes-pos-tagger
> > > project.  But
> > should we leave the current OpenNLP in the default pipeline or default
> > to the latest?
> > >
> > > "The ClearNLP POS tagger shows more robust results on unknown words
> > by generalizing lexical features.  You can find the reference from this paper.
> > > Fast and Robust Part-of-Speech Tagging Using Dynamic Model
> > > Selection,
> > Jinho D. Choi, Martha Palmer, Proceedings of the 50th Annual Meeting
> > of the Association for Computational Linguistics (ACL'12), 363-367, Jeju,
> Korea, 2012.
> > [1] It also uses AdaGrad for machine learning, which is a more
> > advanced learning algorithm than maximum entropy used by OpenNLP."
> > >
> > > [1] http://aclweb.org/anthology-new/P/P12/P12-2071.pdf
> >
> >
> > --
> > -------------------------------------------------------------------
> > Richard Eckart de Castilho
> > Technical Lead
> > Ubiquitous Knowledge Processing Lab (UKP-TUD) FB 20 Computer Science
> > Department Technische Universität Darmstadt Hochschulstr. 10, D-64289
> > Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455, room
> > S2/02/B117 eckart@ukp.informatik.tu-darmstadt.de
> > www.ukp.tu-darmstadt.de
> > Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
> > -------------------------------------------------------------------

Re: ClearNLP POSTagger

Posted by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>.

Am 09.04.2013 um 03:29 schrieb "Chen, Pei" <Pe...@childrens.harvard.edu>:

> Hi Richard,
> It is so useful that someone is maintaining these as Maven artifacts.  Do you have the artifact names/ids to the corresponding models?  I'm thinking to reuse those from Maven central if they already exists.


I know of some models that are hosted on Maven Central:

- Provided by Stanford for the Stanford tools

	groupId: 	edu.stanford.nlp
	artifactIds:	stanford-parser (classifier: models)
			stanford-corenlp (classifier: models)

- Provided by Washington (for ClearNLP, Stanford tools)

	groupId:	edu.washington.cs.knowitall.*
	artifactIds:	*-models

So far, I don't find these very useful due to packaging mistakes
and incompleteness. That is why we still host our own public model
repository at:

  https://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-model-releases-local

We hesitate pushing the models from there to Maven Central, because
the licensing of the models is quite unclear, even for those that
ship with tools that have a clear license, such as the Stanford tools.

We consider pushing proxy-artifacts to Maven central which can be
used as dependencies and which declare a reference to our public
repository and to the model artifact hosted there. The packaging
of the models is currently being revised and the new versions
will soon be pushed first to our public repository for testing, and
later to Maven Central. Some test staging of the new packaging
can at times be found here

  https://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-ukp-staging-local

Cheers,

-- Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

RE: ClearNLP POSTagger

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Hi Richard,
It is so useful that someone is maintaining these as Maven artifacts.  Do you have the artifact names/ids to the corresponding models?  I'm thinking to reuse those from Maven central if they already exists.

--Pei

________________________________________
From: Richard Eckart de Castilho [eckart@ukp.informatik.tu-darmstadt.de]
Sent: Monday, April 08, 2013 5:28 PM
To: <de...@ctakes.apache.org>
Subject: Re: ClearNLP POSTagger

Hi Pei,

> Yes- the ClearNLP tools (POSTagger, Dependency Parser, SRL) in cTAKES were retrained with additional data (MiPAQ/SHARP).
> The Dependency Parser/SRL replaced the existing one because the old ClearParser ones were no longer supported.

I'm only interested in the models - basically any models actually. We have the ClearNLP stuff also integrated in DKPro Core and we collect models for everything and redistribute them via our public Maven server (unless explicitly forbidden). If you have models for any tools that we may (or may not yet) have in DKPro Core, I'd be happy to package them up as Maven artifacts. We already have a good collection of stuff hosted there. Here is a slightly outdated overview that may give an impression: http://bit.ly/UBjusE

Cheers,

-- Richard

--
-------------------------------------------------------------------
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD)
FB 20 Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

RE: ClearNLP POSTagger

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Hi Richard,
Check out the 1.3.0 pre-trained models below- courtesy of Jinho and Co.:
https://code.google.com/p/clearnlp/wiki/TrainedModels

--Pei

> -----Original Message-----
> From: Richard Eckart de Castilho [mailto:eckart@ukp.informatik.tu-
> darmstadt.de]
> Sent: Monday, April 08, 2013 5:29 PM
> To: <de...@ctakes.apache.org>
> Subject: Re: ClearNLP POSTagger
> 
> Hi Pei,
> 
> > Yes- the ClearNLP tools (POSTagger, Dependency Parser, SRL) in cTAKES
> were retrained with additional data (MiPAQ/SHARP).
> > The Dependency Parser/SRL replaced the existing one because the old
> ClearParser ones were no longer supported.
> 
> I'm only interested in the models - basically any models actually. We have the
> ClearNLP stuff also integrated in DKPro Core and we collect models for
> everything and redistribute them via our public Maven server (unless
> explicitly forbidden). If you have models for any tools that we may (or may
> not yet) have in DKPro Core, I'd be happy to package them up as Maven
> artifacts. We already have a good collection of stuff hosted there. Here is a
> slightly outdated overview that may give an impression: http://bit.ly/UBjusE
> 
> Cheers,
> 
> -- Richard
> 
> --
> -------------------------------------------------------------------
> Richard Eckart de Castilho
> Technical Lead
> Ubiquitous Knowledge Processing Lab (UKP-TUD)
> FB 20 Computer Science Department
> Technische Universität Darmstadt
> Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477,
> fax -5455, room S2/02/B117 eckart@ukp.informatik.tu-darmstadt.de
> www.ukp.tu-darmstadt.de
> Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
> -------------------------------------------------------------------

Re: ClearNLP POSTagger

Posted by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>.

Hi Pei,

> Yes- the ClearNLP tools (POSTagger, Dependency Parser, SRL) in cTAKES were retrained with additional data (MiPAQ/SHARP).  
> The Dependency Parser/SRL replaced the existing one because the old ClearParser ones were no longer supported.

I'm only interested in the models - basically any models actually. We have the ClearNLP stuff also integrated in DKPro Core and we collect models for everything and redistribute them via our public Maven server (unless explicitly forbidden). If you have models for any tools that we may (or may not yet) have in DKPro Core, I'd be happy to package them up as Maven artifacts. We already have a good collection of stuff hosted there. Here is a slightly outdated overview that may give an impression: http://bit.ly/UBjusE

Cheers,

-- Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

RE: ClearNLP POSTagger

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Hi Richard,
Yes- the ClearNLP tools (POSTagger, Dependency Parser, SRL) in cTAKES were retrained with additional data (MiPAQ/SHARP).  
The Dependency Parser/SRL replaced the existing one because the old ClearParser ones were no longer supported.

The ClearPOSTagger wasn't previously available in cTAKES, but we can certainly make it an optional one in case some folks may want to use it.  I'll leave the default one (OpenNLP) as-is for the time being until we get some more users/tests/benchmarks/feedback...

--Pei

> -----Original Message-----
> From: Richard Eckart de Castilho [mailto:eckart@ukp.informatik.tu-
> darmstadt.de]
> Sent: Monday, April 08, 2013 1:43 PM
> To: <de...@ctakes.apache.org>
> Subject: Re: ClearNLP POSTagger
> 
> Hi,
> 
> did you train new models for the ClearNLP/OpenNLP tools? (Maybe I knew if
> I had followed a past discussion on models more closely.)
> 
> Cheers,
> 
> -- Richard
> 
> Am 08.04.2013 um 18:15 schrieb "Chen, Pei"
> <Pe...@childrens.harvard.edu>:
> 
> > Hi,
> > While working on the Dependency Parser/SRL labeler,  we also have a
> POSTagger from ClearNLP.  It is fairly simple and I have the code ready (also
> trained on the same data as the dep parser- MiPaq/SHARP) to be checked-in.
> What does the folks think:
> > We can include both Analysis Engines in the ctakes-pos-tagger project.  But
> should we leave the current OpenNLP in the default pipeline or default to
> the latest?
> >
> > "The ClearNLP POS tagger shows more robust results on unknown words
> by generalizing lexical features.  You can find the reference from this paper.
> > Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection,
> Jinho D. Choi, Martha Palmer, Proceedings of the 50th Annual Meeting of the
> Association for Computational Linguistics (ACL'12), 363-367, Jeju, Korea, 2012.
> [1] It also uses AdaGrad for machine learning, which is a more advanced
> learning algorithm than maximum entropy used by OpenNLP."
> >
> > [1] http://aclweb.org/anthology-new/P/P12/P12-2071.pdf
> 
> 
> --
> -------------------------------------------------------------------
> Richard Eckart de Castilho
> Technical Lead
> Ubiquitous Knowledge Processing Lab (UKP-TUD)
> FB 20 Computer Science Department
> Technische Universität Darmstadt
> Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477,
> fax -5455, room S2/02/B117 eckart@ukp.informatik.tu-darmstadt.de
> www.ukp.tu-darmstadt.de
> Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
> -------------------------------------------------------------------

Re: ClearNLP POSTagger

Posted by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>.

Hi,

did you train new models for the ClearNLP/OpenNLP tools? (Maybe I knew if I had followed a past discussion on models more closely…)

Cheers,

-- Richard

Am 08.04.2013 um 18:15 schrieb "Chen, Pei" <Pe...@childrens.harvard.edu>:

> Hi,
> While working on the Dependency Parser/SRL labeler,  we also have a POSTagger from ClearNLP.  It is fairly simple and I have the code ready (also trained on the same data as the dep parser- MiPaq/SHARP) to be checked-in.  What does the folks think:
> We can include both Analysis Engines in the ctakes-pos-tagger project.  But should we leave the current OpenNLP in the default pipeline or default to the latest?
> 
> "The ClearNLP POS tagger shows more robust results on unknown words by generalizing lexical features.  You can find the reference from this paper.
> Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection, Jinho D. Choi, Martha Palmer, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL'12), 363-367, Jeju, Korea, 2012. [1] It also uses AdaGrad for machine learning, which is a more advanced learning algorithm than maximum entropy used by OpenNLP."
> 
> [1] http://aclweb.org/anthology-new/P/P12/P12-2071.pdf


-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: ClearNLP POSTagger

Posted by Jörn Kottmann <ko...@gmail.com>.

Would it be possible to run some benchmarks so we know the performance 
difference between the two?

The OpenNLP POS Tagger can be customized, currently is possible to 
replace the feature generation,
it can probably be optimized for the medical domain, the default feature 
generation is tuned for the news domain.
Replacing the learning algorithm is currently not possible, but we will 
work on that for the next release.

Do you use a tag dictionary? Maybe it is possible to generate something 
from the existing dictionaries already
used by cTAKES.

Jörn

On 04/08/2013 06:15 PM, Chen, Pei wrote:
> Hi,
> While working on the Dependency Parser/SRL labeler,  we also have a POSTagger from ClearNLP.  It is fairly simple and I have the code ready (also trained on the same data as the dep parser- MiPaq/SHARP) to be checked-in.  What does the folks think:
> We can include both Analysis Engines in the ctakes-pos-tagger project.  But should we leave the current OpenNLP in the default pipeline or default to the latest?
>
> "The ClearNLP POS tagger shows more robust results on unknown words by generalizing lexical features.  You can find the reference from this paper.
> Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection, Jinho D. Choi, Martha Palmer, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL'12), 363-367, Jeju, Korea, 2012. [1] It also uses AdaGrad for machine learning, which is a more advanced learning algorithm than maximum entropy used by OpenNLP."
>
> [1] http://aclweb.org/anthology-new/P/P12/P12-2071.pdf
>

RE: ClearNLP POSTagger

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Okay, 
I'll commit the ClearPOSTagger and make it available in the ctakes-pos-tagger component, but leave everything as they currently are (currently default to OpenNLP).
We can always switch one or the other in the future (when there is a fair comparison/benchmark).

Note: I think there is a pretty significant speed improvement in the ClearPOSTagger as well.

> -----Original Message-----
> From: Lee Becker [mailto:lee.becker@gmail.com]
> Sent: Monday, April 08, 2013 2:29 PM
> To: dev@ctakes.apache.org
> Subject: Re: ClearNLP POSTagger
> 
> On Mon, Apr 8, 2013 at 12:04 PM, Steven Bethard
> <steven.bethard@colorado.edu
> > wrote:
> 
> > > While working on the Dependency Parser/SRL labeler,  we also have a
> > POSTagger from ClearNLP.  It is fairly simple and I have the code
> > ready (also trained on the same data as the dep parser- MiPaq/SHARP)
> > to be checked-in.  What does the folks think:
> > > We can include both Analysis Engines in the ctakes-pos-tagger project.
> >  But should we leave the current OpenNLP in the default pipeline or
> > default to the latest?
> >
> > My vote would be to default for whatever has the best performance.
> > Presumably the ClearNLP one?
> >
> > > "The ClearNLP POS tagger shows more robust results on unknown words
> > > by
> > generalizing lexical features.
> >
> > Looking at the paper, ClearNLP POS tagger is not compared directly to
> > the cTAKES OpenNLP POS tagger, but they do outperform the Stanford
> > tagger trained on the same data, so it's probably a reasonable guess
> > that they're more accurate than the OpenNLP tagger.
> >
> > > It also uses AdaGrad for machine learning, which is a more advanced
> > learning algorithm than maximum entropy used by OpenNLP."
> >
> > My opinion is that we should never include a model in cTAKES just
> > because it has a "more advanced learning algorithm". "More advanced
> > learning algorithm" does not always translate into better performance.
> 
> 
> If my memory is serving me correctly, I think Jinho trained his parsers off of
> predicted POS tags to get eke out the extra performance.  The takeaway
> being that ClearNLP does better when you can use as much of its pipeline as
> possible.

Re: ClearNLP POSTagger

Posted by Lee Becker <le...@gmail.com>.

On Mon, Apr 8, 2013 at 12:04 PM, Steven Bethard <steven.bethard@colorado.edu
> wrote:

> > While working on the Dependency Parser/SRL labeler,  we also have a
> POSTagger from ClearNLP.  It is fairly simple and I have the code ready
> (also trained on the same data as the dep parser- MiPaq/SHARP) to be
> checked-in.  What does the folks think:
> > We can include both Analysis Engines in the ctakes-pos-tagger project.
>  But should we leave the current OpenNLP in the default pipeline or default
> to the latest?
>
> My vote would be to default for whatever has the best performance.
> Presumably the ClearNLP one?
>
> > "The ClearNLP POS tagger shows more robust results on unknown words by
> generalizing lexical features.
>
> Looking at the paper, ClearNLP POS tagger is not compared directly to the
> cTAKES OpenNLP POS tagger, but they do outperform the Stanford tagger
> trained on the same data, so it's probably a reasonable guess that they're
> more accurate than the OpenNLP tagger.
>
> > It also uses AdaGrad for machine learning, which is a more advanced
> learning algorithm than maximum entropy used by OpenNLP."
>
> My opinion is that we should never include a model in cTAKES just because
> it has a "more advanced learning algorithm". "More advanced learning
> algorithm" does not always translate into better performance.


If my memory is serving me correctly, I think Jinho trained his parsers off
of predicted POS tags to get eke out the extra performance.  The takeaway
being that ClearNLP does better when you can use as much of its pipeline as
possible.

Re: ClearNLP POSTagger

Posted by Steven Bethard <st...@Colorado.EDU>.

On Apr 8, 2013, at 10:15 AM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> While working on the Dependency Parser/SRL labeler,  we also have a POSTagger from ClearNLP.  It is fairly simple and I have the code ready (also trained on the same data as the dep parser- MiPaq/SHARP) to be checked-in.  What does the folks think:
> We can include both Analysis Engines in the ctakes-pos-tagger project.  But should we leave the current OpenNLP in the default pipeline or default to the latest?

My vote would be to default for whatever has the best performance. Presumably the ClearNLP one?

> "The ClearNLP POS tagger shows more robust results on unknown words by generalizing lexical features.

Looking at the paper, ClearNLP POS tagger is not compared directly to the cTAKES OpenNLP POS tagger, but they do outperform the Stanford tagger trained on the same data, so it's probably a reasonable guess that they're more accurate than the OpenNLP tagger.

> It also uses AdaGrad for machine learning, which is a more advanced learning algorithm than maximum entropy used by OpenNLP."

My opinion is that we should never include a model in cTAKES just because it has a "more advanced learning algorithm". "More advanced learning algorithm" does not always translate into better performance.

Steve