You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by John Stewart <ca...@gmail.com> on 2012/07/11 20:38:42 UTC

Current state of coref

Hi,

On a different thread I saw that coref dependencies are creating a
small nuisance.  In my testing the module showed limited recall, so
I'm wondering whether it's in active development, and if not -- does
anyone use it?

Thanks,

jds

Re: Current state of coref

Posted by John Stewart <ca...@gmail.com>.
Has Penn said explicitly that they'll never release the Treebank under
an open license?

jds

On Thu, Jul 19, 2012 at 10:22 PM, James Kosin <ja...@gmail.com> wrote:
> On 7/19/2012 2:07 AM, Lance Norskog wrote:
>> What is the legitimacy of data which is tagged using an encumbered
>> model? I mean, if I tag documents with OpenNLP's non-free models on
>> sourceforge, the tagged output is a "derived work". Is this tagged
>> output considered free? Does this depend on the license of the
>> original data?
>>
>>
> Lance,
>
> The problem is two-fold.
>
> (1)  We would like to distribute the models on Apache.  Unfortunately,
> to do so would mean the models and source used to create the models
> would have to be under the Apache license to be distributed.  We don't
> see any way around this than to generate our own training data with an
> open license compatible with the Apache license.
>   Jorn is getting the groundwork done for this with the tagging server
> to allow us to hand-tag and correct data for our own training data.  I
> know it is re-doing work that already has been done; but, the benefits
> will be large in the long run.  Anyone could download the training data
> and add/remove/etc all they want to customize the training set to
> various situations without the worry of a copyright issue.
>   The down side, we have a lot of work to do to get there.
>
> (2)  The models themselves although available on sourceforge are for
> research purposes ONLY.  The copyright and contract with those holding
> the copyright for the original works have stated so.  I've asked many on
> this point.  We are not helping by breaking the law on this, nor do we
> suggest anyone to do this.
>   The next problem is we can't distribute the training data for the
> models.... so, modifications to the models are next to impossible to add
> training for other situations.  The data used to train are mainly from
> news sources and that limits some of the usefulness for some.
>
> .....
> I guess I'll have to get the FAQ section on our web-site done soon.
>
> Thanks,
> James

Re: Current state of coref

Posted by James Kosin <ja...@gmail.com>.
On 7/19/2012 2:07 AM, Lance Norskog wrote:
> What is the legitimacy of data which is tagged using an encumbered
> model? I mean, if I tag documents with OpenNLP's non-free models on
> sourceforge, the tagged output is a "derived work". Is this tagged
> output considered free? Does this depend on the license of the
> original data?
>
>
Lance,

The problem is two-fold.

(1)  We would like to distribute the models on Apache.  Unfortunately,
to do so would mean the models and source used to create the models
would have to be under the Apache license to be distributed.  We don't
see any way around this than to generate our own training data with an
open license compatible with the Apache license.
  Jorn is getting the groundwork done for this with the tagging server
to allow us to hand-tag and correct data for our own training data.  I
know it is re-doing work that already has been done; but, the benefits
will be large in the long run.  Anyone could download the training data
and add/remove/etc all they want to customize the training set to
various situations without the worry of a copyright issue.
  The down side, we have a lot of work to do to get there.

(2)  The models themselves although available on sourceforge are for
research purposes ONLY.  The copyright and contract with those holding
the copyright for the original works have stated so.  I've asked many on
this point.  We are not helping by breaking the law on this, nor do we
suggest anyone to do this.
  The next problem is we can't distribute the training data for the
models.... so, modifications to the models are next to impossible to add
training for other situations.  The data used to train are mainly from
news sources and that limits some of the usefulness for some.

.....
I guess I'll have to get the FAQ section on our web-site done soon.

Thanks,
James

Re: Current state of coref

Posted by Lance Norskog <go...@gmail.com>.
What is the legitimacy of data which is tagged using an encumbered
model? I mean, if I tag documents with OpenNLP's non-free models on
sourceforge, the tagged output is a "derived work". Is this tagged
output considered free? Does this depend on the license of the
original data?

On Wed, Jul 18, 2012 at 1:28 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 07/18/2012 04:30 AM, Lance Norskog wrote:
>>
>> Please use unencumbered training data for all future OpenNLP projects.
>
>
> We of course would like to do that, but it is not that easy.
> For coreference there is no good data set which is available
> under some kind of Open Source license.
>
> The only way to *fix* that is to produce your own training
> data based on a text source which can be shared under an
> OS license.
>
> We started to work on making tooling to crowd source such annotations,
> but we still need to do a lot to finish this. So any help in this area is
> very welcome.
>
>
>> What exactly does a coref training dataset have to include? What kind
>> of tagging or cross-referencing?
>
>
> - Full or shallow parse
> - Named Entities
> - Linked mentions
>
> Have a look at this thread:
> http://mail-archives.apache.org/mod_mbox/opennlp-dev/201203.mbox/%3C4F7300F3.5050505@gmail.com%3E
>
> I proposed the new format there and then implemented it.
>
> For OntoNotes we need to do some adaption to get it into something
> you can use for training, e.g. filtering verb mentions, doing the parsing,
> etc.
> If we get it trained nicely on this dataset it would be a good step forward.
>
> Jörn
>



-- 
Lance Norskog
goksron@gmail.com

Re: Current state of coref

Posted by Jörn Kottmann <ko...@gmail.com>.
On 07/18/2012 04:30 AM, Lance Norskog wrote:
> Please use unencumbered training data for all future OpenNLP projects.

We of course would like to do that, but it is not that easy.
For coreference there is no good data set which is available
under some kind of Open Source license.

The only way to *fix* that is to produce your own training
data based on a text source which can be shared under an
OS license.

We started to work on making tooling to crowd source such annotations,
but we still need to do a lot to finish this. So any help in this area 
is very welcome.

> What exactly does a coref training dataset have to include? What kind
> of tagging or cross-referencing?

- Full or shallow parse
- Named Entities
- Linked mentions

Have a look at this thread:
http://mail-archives.apache.org/mod_mbox/opennlp-dev/201203.mbox/%3C4F7300F3.5050505@gmail.com%3E

I proposed the new format there and then implemented it.

For OntoNotes we need to do some adaption to get it into something
you can use for training, e.g. filtering verb mentions, doing the 
parsing, etc.
If we get it trained nicely on this dataset it would be a good step forward.

Jörn


Re: Current state of coref

Posted by Lance Norskog <go...@gmail.com>.
Please use unencumbered training data for all future OpenNLP projects.

What exactly does a coref training dataset have to include? What kind
of tagging or cross-referencing?

On Tue, Jul 17, 2012 at 10:59 AM, John Stewart <ca...@gmail.com> wrote:
> Ah good, I was going to ask about parses too -- so this is done.  I'll
> start reading the code tonight.
>
> OntoNotes is smallish, yes?  Is the English bit larger than the CoNLL
> data set?  In terms of cost, isn't it free?
>
> Thanks,
>
> jds
>
> On Tue, Jul 17, 2012 at 11:09 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>> On 07/17/2012 05:03 PM, John Stewart wrote:
>>>
>>> OK so per thishttps://issues.apache.org/jira/browse/OPENNLP-54
>>>
>>> you're saying that results may improve with the CONLL training set,
>>> yes?  That definitely seems worth trying to me.  Now, what, if any,
>>> policies are there about dependencies between OpenNLP modules?  I ask
>>> because the coref task might benefit from the NE output -- perhaps
>>> they are already linked!
>>
>>
>> The input for coref is this:
>> - Full or shallow parse (depends on how the model was trained)
>> - NER output
>>
>> All this information is encoded into Parse objects and therefore no
>> direct link between the components is necessary.
>> You can see this nicely when you run the command line demo.
>>
>> Yes, we need a corpus to train it on. Maybe OntoNotes would be a good
>> candidate, its affordable to everyone.
>>
>> What do you think?
>>
>> Jörn
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Current state of coref

Posted by John Stewart <ca...@gmail.com>.
Jörn,

In the new (and neat) Tool mechanism for 1.5, is there still a way to
send parsed (tree) input to the NER module?  Basically I'm trying to
put together the pipeline to the Coref Tool, but I'm not sure of how
to hook it up to both parsed and NER-marked output.

Thanks,

jds

On Tue, Jul 17, 2012 at 1:59 PM, John Stewart <ca...@gmail.com> wrote:
> Ah good, I was going to ask about parses too -- so this is done.  I'll
> start reading the code tonight.
>
> OntoNotes is smallish, yes?  Is the English bit larger than the CoNLL
> data set?  In terms of cost, isn't it free?
>
> Thanks,
>
> jds
>
> On Tue, Jul 17, 2012 at 11:09 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>> On 07/17/2012 05:03 PM, John Stewart wrote:
>>>
>>> OK so per thishttps://issues.apache.org/jira/browse/OPENNLP-54
>>>
>>> you're saying that results may improve with the CONLL training set,
>>> yes?  That definitely seems worth trying to me.  Now, what, if any,
>>> policies are there about dependencies between OpenNLP modules?  I ask
>>> because the coref task might benefit from the NE output -- perhaps
>>> they are already linked!
>>
>>
>> The input for coref is this:
>> - Full or shallow parse (depends on how the model was trained)
>> - NER output
>>
>> All this information is encoded into Parse objects and therefore no
>> direct link between the components is necessary.
>> You can see this nicely when you run the command line demo.
>>
>> Yes, we need a corpus to train it on. Maybe OntoNotes would be a good
>> candidate, its affordable to everyone.
>>
>> What do you think?
>>
>> Jörn
>>

Re: Current state of coref

Posted by John Stewart <ca...@gmail.com>.
Ah good, I was going to ask about parses too -- so this is done.  I'll
start reading the code tonight.

OntoNotes is smallish, yes?  Is the English bit larger than the CoNLL
data set?  In terms of cost, isn't it free?

Thanks,

jds

On Tue, Jul 17, 2012 at 11:09 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 07/17/2012 05:03 PM, John Stewart wrote:
>>
>> OK so per thishttps://issues.apache.org/jira/browse/OPENNLP-54
>>
>> you're saying that results may improve with the CONLL training set,
>> yes?  That definitely seems worth trying to me.  Now, what, if any,
>> policies are there about dependencies between OpenNLP modules?  I ask
>> because the coref task might benefit from the NE output -- perhaps
>> they are already linked!
>
>
> The input for coref is this:
> - Full or shallow parse (depends on how the model was trained)
> - NER output
>
> All this information is encoded into Parse objects and therefore no
> direct link between the components is necessary.
> You can see this nicely when you run the command line demo.
>
> Yes, we need a corpus to train it on. Maybe OntoNotes would be a good
> candidate, its affordable to everyone.
>
> What do you think?
>
> Jörn
>

Re: Current state of coref

Posted by Jörn Kottmann <ko...@gmail.com>.
On 07/17/2012 05:03 PM, John Stewart wrote:
> OK so per thishttps://issues.apache.org/jira/browse/OPENNLP-54
> you're saying that results may improve with the CONLL training set,
> yes?  That definitely seems worth trying to me.  Now, what, if any,
> policies are there about dependencies between OpenNLP modules?  I ask
> because the coref task might benefit from the NE output -- perhaps
> they are already linked!

The input for coref is this:
- Full or shallow parse (depends on how the model was trained)
- NER output

All this information is encoded into Parse objects and therefore no
direct link between the components is necessary.
You can see this nicely when you run the command line demo.

Yes, we need a corpus to train it on. Maybe OntoNotes would be a good
candidate, its affordable to everyone.

What do you think?

Jörn


Re: Current state of coref

Posted by John Stewart <ca...@gmail.com>.
OK so per this https://issues.apache.org/jira/browse/OPENNLP-54
you're saying that results may improve with the CONLL training set,
yes?  That definitely seems worth trying to me.  Now, what, if any,
policies are there about dependencies between OpenNLP modules?  I ask
because the coref task might benefit from the NE output -- perhaps
they are already linked!

jds

On Tue, Jul 17, 2012 at 8:04 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 07/17/2012 01:55 PM, John Stewart wrote:
>>
>> Well, my sense is that before much more work on packaging steps are
>> done, the quality of the output needs to improve.  I'm not sure it's
>> just a matter of training -- but at this point I'm not at all sure of
>> what I'm saying.  My*impression*  is that the module needs to
>>
>> incorporate a bit more knowledge of language in order to increase
>> recall without over-generating.  Does that make sense?  Also, is there
>> any documentation on how it works currently?  I would be interested in
>> helping, time permitting as always.
>
>
> We do not have documentation. There are some posts on our
> mailing list speaking about it, there is a thesis from Thomas Morton
> which has a chapter about the coref component.
>
> I would like to at least provide very basic documentation for
> the next release.
>
> Do you want to propose some changes or do you have ideas what
> we can do to improve the quality of the output?
>
> The coref component was implemented by Tom and we just maintained
> it a very bit here, but do not have good knowledge about it, anyway, that
> is something that should be changed, and I actually did read and work on
> the code while looking into how to add training support to it.
>
> Do you think OntoNotes is a good data set to continue the development?
>
> Jörn
>

Re: Current state of coref

Posted by Jörn Kottmann <ko...@gmail.com>.
On 07/17/2012 01:55 PM, John Stewart wrote:
> Well, my sense is that before much more work on packaging steps are
> done, the quality of the output needs to improve.  I'm not sure it's
> just a matter of training -- but at this point I'm not at all sure of
> what I'm saying.  My*impression*  is that the module needs to
> incorporate a bit more knowledge of language in order to increase
> recall without over-generating.  Does that make sense?  Also, is there
> any documentation on how it works currently?  I would be interested in
> helping, time permitting as always.

We do not have documentation. There are some posts on our
mailing list speaking about it, there is a thesis from Thomas Morton
which has a chapter about the coref component.

I would like to at least provide very basic documentation for
the next release.

Do you want to propose some changes or do you have ideas what
we can do to improve the quality of the output?

The coref component was implemented by Tom and we just maintained
it a very bit here, but do not have good knowledge about it, anyway, that
is something that should be changed, and I actually did read and work on
the code while looking into how to add training support to it.

Do you think OntoNotes is a good data set to continue the development?

Jörn


Re: Current state of coref

Posted by John Stewart <ca...@gmail.com>.
Well, my sense is that before much more work on packaging steps are
done, the quality of the output needs to improve.  I'm not sure it's
just a matter of training -- but at this point I'm not at all sure of
what I'm saying.  My *impression* is that the module needs to
incorporate a bit more knowledge of language in order to increase
recall without over-generating.  Does that make sense?  Also, is there
any documentation on how it works currently?  I would be interested in
helping, time permitting as always.

Thanks,

jds

On Thu, Jul 12, 2012 at 6:11 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 07/11/2012 08:38 PM, John Stewart wrote:
>>
>> On a different thread I saw that coref dependencies are creating a
>> small nuisance.  In my testing the module showed limited recall, so
>> I'm wondering whether it's in active development, and if not -- does
>> anyone use it?
>
>
> I am planning to use it soon, and spent quite some time to work
> on the training, but that is still work in progress.
> You are welcome to help us with that. There are quite some things
> that still need to be done to get it into the same state as the other
> components, e.g. improve documentation, fix the training support,
> use a model package, ease up the usage of jwnl, implement
> evaluation, etc.
>
> Jörn
>

Re: Current state of coref

Posted by Jörn Kottmann <ko...@gmail.com>.
On 07/11/2012 08:38 PM, John Stewart wrote:
> On a different thread I saw that coref dependencies are creating a
> small nuisance.  In my testing the module showed limited recall, so
> I'm wondering whether it's in active development, and if not -- does
> anyone use it?

I am planning to use it soon, and spent quite some time to work
on the training, but that is still work in progress.
You are welcome to help us with that. There are quite some things
that still need to be done to get it into the same state as the other
components, e.g. improve documentation, fix the training support,
use a model package, ease up the usage of jwnl, implement
evaluation, etc.

Jörn