You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by James Kosin <ja...@gmail.com> on 2011/11/17 05:50:46 UTC

JWNL

All,

I just saw this that may be interesting to update to...
    http://sourceforge.net/projects/extjwnl/

James

Re: Coref problem

Posted by Jörn Kottmann <ko...@gmail.com>.

Here is a link to the Cas Editor documentation:
http://uima.apache.org/d/uimaj-2.3.1/tools.html#ugr.tools.ce

Jörn

On 11/17/11 12:43 PM, Jörn Kottmann wrote:
> On 11/17/11 11:59 AM, Aliaksandr Autayeu wrote:
>> On Thu, Nov 17, 2011 at 11:48 AM, Jörn Kottmann<ko...@gmail.com>  
>> wrote:
>>
>>> On 11/17/11 11:32 AM, Aliaksandr Autayeu wrote:
>>>
>>>> We shouldn't replace JWNL with a newer version,
>>>>> because we currently don't have the ability to train
>>>>> or evaluate the coref component.
>>>>>
>>>>>   +1. Having tests coverage eases many things, refactoring and 
>>>>> development
>>>> included :)
>>>>
>>>> This is a big issue for us because that also blocks
>>>>
>>>>> other changes and updates to the code itself,
>>>>> e.g. the cleanups Aliaksandr contributed.
>>>>>
>>>>> What we need here is a plan how we can get the coref component
>>>>> into a state which makes it possible to develop it in a community.
>>>>>
>>>>> If we don't find a way to resolve this I think we should move the 
>>>>> coref
>>>>> stuff
>>>>> to the sandbox and leave it there until we have some training data.
>>>>>
>>>>>   In my experience doing things like this is almost equal to 
>>>>> deleting the
>>>> piece of code altogether. On the other side, if there is no developer,
>>>> actively using and developing this piece, having corpora, tests, etc,
>>>> others might not have enough incentives.
>>>>
>>> That is already the situation the developer who wrote doesn't 
>>> support it
>>> anymore.
>>> The only way to get it alive again would be to get the training and
>>> evaluation running.
>>> If we have that, it will be possible to continue to work on it, and 
>>> people
>>> can start using
>>> it. The code itself is easy to understand and I have a good idea of 
>>> how it
>>> works.
>>>
>>> In the current state it really blocks the development of a few things.
>>>
>>>
>>>> Another option would be label enough wikinews data, so we are able to
>>>> train it.
>>>>
>>>> How much exactly is this "enough"? And what's the annotation UI? 
>>>> This also
>>>> might be a good option to improve the annotation tools. I might be
>>>> interested in pursuing this option (only if the corpus produced 
>>>> will be
>>>> under a free license), mainly to learn :) but I would need some 
>>>> help and
>>>> supervision.
>>>>
>>> We are discussing to do a wikinews crowd sourcing project to label
>>> training data for all components in OpenNLP.
>>>
>>> I once wrote a proposal to communicate this idea:
>>> https://cwiki.apache.org/**OPENNLP/opennlp-annotations.**html<https://cwiki.apache.org/OPENNLP/opennlp-annotations.html> 
>>>
>>>
>>> Currently we have a first version of the Corpus Server and plugins
>>> for the UIMA Cas Editor (an annotation tool) to access articles in the
>>> Corpus Server and
>>> an OpenNLP Plugin which can help with doing sentence detection,
>>> tokenization and NER (could be extended with support for coref).
>>>
>>> These tools are all located in the sandbox.
>>>
>>> I am currently using them to run a private annotation project, and
>>> therefore have time to work on them.
>> I'll get a look at them. I also have my own annotation tools, because I
>> wasn't happy with what was available out there few years ago and 
>> because of
>> some specifics of the situation which can be exploited to speed up the
>> annotation, but I would be happy to avoid duplication.
>>
>>
>
> Are your own tools also Open Source? The Cas Editor itself is often 
> criticized to
> not fit the needs of a particular annotation project, but it can 
> easily be extended
> by a plugin which adds a new eclipse view to show just the information 
> you need.
> I did this a lot for a few very specific things.
>
> I think UIMA is a great platform for annotation tooling since the UIMA 
> CAS (a data structure
> which can be used to contain text and annotations) gives you many 
> features you need
> to make such a tool and is easy to adapt to new use cases, e.g. 
> defining a new feature structure
> type.
>
> OpenNLP already has training support for UIMA, which I use to train 
> new models, these are then
> placed on a http server and the OpenNLP Cas Editor Plugin can load 
> models via http.
> With this setup you have a closed learning loop and training can be 
> done every few minutes.
>
> Back to the coref component, I had a look at extjwnl, one of the 
> issues I noticed with WordNet is
> that there are so many different versions and formats for different 
> languages, which makes it
> hard to integrate them into coref (which should one day be able to 
> support other languages as well).
> I always though we might need to define our own WordNet data format, 
> so we can easily handle
> WordNets for different languages.
>
> I saw that you worked on this library, maybe that could be something 
> we can move to OpenNLP
> or base some new work on.
>
> Another issue is that we have a zip package which contains all 
> resources loaded into a component,
> but it looks like that this is not so easy with the current WordNet 
> directory.
>
> Jörn

Re: Coref problem

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/17/11 11:59 AM, Aliaksandr Autayeu wrote:
> On Thu, Nov 17, 2011 at 11:48 AM, Jörn Kottmann<ko...@gmail.com>  wrote:
>
>> On 11/17/11 11:32 AM, Aliaksandr Autayeu wrote:
>>
>>> We shouldn't replace JWNL with a newer version,
>>>> because we currently don't have the ability to train
>>>> or evaluate the coref component.
>>>>
>>>>   +1. Having tests coverage eases many things, refactoring and development
>>> included :)
>>>
>>> This is a big issue for us because that also blocks
>>>
>>>> other changes and updates to the code itself,
>>>> e.g. the cleanups Aliaksandr contributed.
>>>>
>>>> What we need here is a plan how we can get the coref component
>>>> into a state which makes it possible to develop it in a community.
>>>>
>>>> If we don't find a way to resolve this I think we should move the coref
>>>> stuff
>>>> to the sandbox and leave it there until we have some training data.
>>>>
>>>>   In my experience doing things like this is almost equal to deleting the
>>> piece of code altogether. On the other side, if there is no developer,
>>> actively using and developing this piece, having corpora, tests, etc,
>>> others might not have enough incentives.
>>>
>> That is already the situation the developer who wrote doesn't support it
>> anymore.
>> The only way to get it alive again would be to get the training and
>> evaluation running.
>> If we have that, it will be possible to continue to work on it, and people
>> can start using
>> it. The code itself is easy to understand and I have a good idea of how it
>> works.
>>
>> In the current state it really blocks the development of a few things.
>>
>>
>>> Another option would be label enough wikinews data, so we are able to
>>> train it.
>>>
>>> How much exactly is this "enough"? And what's the annotation UI? This also
>>> might be a good option to improve the annotation tools. I might be
>>> interested in pursuing this option (only if the corpus produced will be
>>> under a free license), mainly to learn :) but I would need some help and
>>> supervision.
>>>
>> We are discussing to do a wikinews crowd sourcing project to label
>> training data for all components in OpenNLP.
>>
>> I once wrote a proposal to communicate this idea:
>> https://cwiki.apache.org/**OPENNLP/opennlp-annotations.**html<https://cwiki.apache.org/OPENNLP/opennlp-annotations.html>
>>
>> Currently we have a first version of the Corpus Server and plugins
>> for the UIMA Cas Editor (an annotation tool) to access articles in the
>> Corpus Server and
>> an OpenNLP Plugin which can help with doing sentence detection,
>> tokenization and NER (could be extended with support for coref).
>>
>> These tools are all located in the sandbox.
>>
>> I am currently using them to run a private annotation project, and
>> therefore have time to work on them.
> I'll get a look at them. I also have my own annotation tools, because I
> wasn't happy with what was available out there few years ago and because of
> some specifics of the situation which can be exploited to speed up the
> annotation, but I would be happy to avoid duplication.
>
>

Are your own tools also Open Source? The Cas Editor itself is often 
criticized to
not fit the needs of a particular annotation project, but it can easily 
be extended
by a plugin which adds a new eclipse view to show just the information 
you need.
I did this a lot for a few very specific things.

I think UIMA is a great platform for annotation tooling since the UIMA 
CAS (a data structure
which can be used to contain text and annotations) gives you many 
features you need
to make such a tool and is easy to adapt to new use cases, e.g. defining 
a new feature structure
type.

OpenNLP already has training support for UIMA, which I use to train new 
models, these are then
placed on a http server and the OpenNLP Cas Editor Plugin can load 
models via http.
With this setup you have a closed learning loop and training can be done 
every few minutes.

Back to the coref component, I had a look at extjwnl, one of the issues 
I noticed with WordNet is
that there are so many different versions and formats for different 
languages, which makes it
hard to integrate them into coref (which should one day be able to 
support other languages as well).
I always though we might need to define our own WordNet data format, so 
we can easily handle
WordNets for different languages.

I saw that you worked on this library, maybe that could be something we 
can move to OpenNLP
or base some new work on.

Another issue is that we have a zip package which contains all resources 
loaded into a component,
but it looks like that this is not so easy with the current WordNet 
directory.

Jörn

Re: Coref problem

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

On Thu, Nov 17, 2011 at 11:48 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/17/11 11:32 AM, Aliaksandr Autayeu wrote:
>
>> We shouldn't replace JWNL with a newer version,
>>> because we currently don't have the ability to train
>>> or evaluate the coref component.
>>>
>>>  +1. Having tests coverage eases many things, refactoring and development
>> included :)
>>
>> This is a big issue for us because that also blocks
>>
>>> other changes and updates to the code itself,
>>> e.g. the cleanups Aliaksandr contributed.
>>>
>>> What we need here is a plan how we can get the coref component
>>> into a state which makes it possible to develop it in a community.
>>>
>>> If we don't find a way to resolve this I think we should move the coref
>>> stuff
>>> to the sandbox and leave it there until we have some training data.
>>>
>>>  In my experience doing things like this is almost equal to deleting the
>> piece of code altogether. On the other side, if there is no developer,
>> actively using and developing this piece, having corpora, tests, etc,
>> others might not have enough incentives.
>>
>
> That is already the situation the developer who wrote doesn't support it
> anymore.
> The only way to get it alive again would be to get the training and
> evaluation running.
> If we have that, it will be possible to continue to work on it, and people
> can start using
> it. The code itself is easy to understand and I have a good idea of how it
> works.
>
> In the current state it really blocks the development of a few things.
>
>
>> Another option would be label enough wikinews data, so we are able to
>> train it.
>>
>> How much exactly is this "enough"? And what's the annotation UI? This also
>> might be a good option to improve the annotation tools. I might be
>> interested in pursuing this option (only if the corpus produced will be
>> under a free license), mainly to learn :) but I would need some help and
>> supervision.
>>
>
> We are discussing to do a wikinews crowd sourcing project to label
> training data for all components in OpenNLP.
>
> I once wrote a proposal to communicate this idea:
> https://cwiki.apache.org/**OPENNLP/opennlp-annotations.**html<https://cwiki.apache.org/OPENNLP/opennlp-annotations.html>
>
> Currently we have a first version of the Corpus Server and plugins
> for the UIMA Cas Editor (an annotation tool) to access articles in the
> Corpus Server and
> an OpenNLP Plugin which can help with doing sentence detection,
> tokenization and NER (could be extended with support for coref).
>
> These tools are all located in the sandbox.
>
> I am currently using them to run a private annotation project, and
> therefore have time to work on them.

I'll get a look at them. I also have my own annotation tools, because I
wasn't happy with what was available out there few years ago and because of
some specifics of the situation which can be exploited to speed up the
annotation, but I would be happy to avoid duplication.

Aliaksandr

Re: Coref problem

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/17/11 11:32 AM, Aliaksandr Autayeu wrote:
>> We shouldn't replace JWNL with a newer version,
>> because we currently don't have the ability to train
>> or evaluate the coref component.
>>
> +1. Having tests coverage eases many things, refactoring and development
> included :)
>
> This is a big issue for us because that also blocks
>> other changes and updates to the code itself,
>> e.g. the cleanups Aliaksandr contributed.
>>
>> What we need here is a plan how we can get the coref component
>> into a state which makes it possible to develop it in a community.
>>
>> If we don't find a way to resolve this I think we should move the coref
>> stuff
>> to the sandbox and leave it there until we have some training data.
>>
> In my experience doing things like this is almost equal to deleting the
> piece of code altogether. On the other side, if there is no developer,
> actively using and developing this piece, having corpora, tests, etc,
> others might not have enough incentives.

That is already the situation the developer who wrote doesn't support it 
anymore.
The only way to get it alive again would be to get the training and 
evaluation running.
If we have that, it will be possible to continue to work on it, and 
people can start using
it. The code itself is easy to understand and I have a good idea of how 
it works.

In the current state it really blocks the development of a few things.

>
> Another option would be label enough wikinews data, so we are able to
> train it.
>
> How much exactly is this "enough"? And what's the annotation UI? This also
> might be a good option to improve the annotation tools. I might be
> interested in pursuing this option (only if the corpus produced will be
> under a free license), mainly to learn :) but I would need some help and
> supervision.

We are discussing to do a wikinews crowd sourcing project to label
training data for all components in OpenNLP.

I once wrote a proposal to communicate this idea:
https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

Currently we have a first version of the Corpus Server and plugins
for the UIMA Cas Editor (an annotation tool) to access articles in the 
Corpus Server and
an OpenNLP Plugin which can help with doing sentence detection,
tokenization and NER (could be extended with support for coref).

These tools are all located in the sandbox.

I am currently using them to run a private annotation project, and
therefore have time to work on them.

Jörn

Re: Coref problem (was: JWNL)

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

> We shouldn't replace JWNL with a newer version,
> because we currently don't have the ability to train
> or evaluate the coref component.
>
+1. Having tests coverage eases many things, refactoring and development
included :)

This is a big issue for us because that also blocks
> other changes and updates to the code itself,
> e.g. the cleanups Aliaksandr contributed.
>
> What we need here is a plan how we can get the coref component
> into a state which makes it possible to develop it in a community.
>
> If we don't find a way to resolve this I think we should move the coref
> stuff
> to the sandbox and leave it there until we have some training data.
>
In my experience doing things like this is almost equal to deleting the
piece of code altogether. On the other side, if there is no developer,
actively using and developing this piece, having corpora, tests, etc,
others might not have enough incentives.

Don't having the ability to train coref also blocks changes we might want
> to do the our maxent library.
>
> Maybe it is possible to buy a license for MUC 6 and 7 data, so we can share
> this data privately by the team. Are any people familiar if that would be
> possible
> with the LDC license?
>
> The CONLL2011 data (OntoNotes, costs 50$) might also be suitable to train
> it:
> http://conll.bbn.com/index.**php/data.html<http://conll.bbn.com/index.php/data.html>
>
> Another option would be label enough wikinews data, so we are able to
> train it.
>
How much exactly is this "enough"? And what's the annotation UI? This also
might be a good option to improve the annotation tools. I might be
interested in pursuing this option (only if the corpus produced will be
under a free license), mainly to learn :) but I would need some help and
supervision.

Aliaksandr

Coref problem (was: JWNL)

Posted by Jörn Kottmann <ko...@gmail.com>.

We shouldn't replace JWNL with a newer version,
because we currently don't have the ability to train
or evaluate the coref component.

This is a big issue for us because that also blocks
other changes and updates to the code itself,
e.g. the cleanups Aliaksandr contributed.

What we need here is a plan how we can get the coref component
into a state which makes it possible to develop it in a community.

If we don't find a way to resolve this I think we should move the coref 
stuff
to the sandbox and leave it there until we have some training data.
Don't having the ability to train coref also blocks changes we might want
to do the our maxent library.

Maybe it is possible to buy a license for MUC 6 and 7 data, so we can share
this data privately by the team. Are any people familiar if that would 
be possible
with the LDC license?

The CONLL2011 data (OntoNotes, costs 50$) might also be suitable to 
train it:
http://conll.bbn.com/index.php/data.html

Another option would be label enough wikinews data, so we are able to 
train it.

Jörn

On 11/17/11 5:50 AM, James Kosin wrote:
> All,
>
> I just saw this that may be interesting to update to...
>      http://sourceforge.net/projects/extjwnl/
>
> James

Re: JWNL

Posted by Chris Fournier <ch...@gmail.com>.

Not to be confused with eXtended WordNet
<http://xwn.hlt.utdallas.edu/> (extJWNL
chose an awkward name).

On Wed, Nov 16, 2011 at 11:50 PM, James Kosin <ja...@gmail.com> wrote:

> All,
>
> I just saw this that may be interesting to update to...
>    http://sourceforge.net/projects/extjwnl/
>
> James
>