You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by "william.colen@gmail.com" <wi...@gmail.com> on 2011/08/02 19:42:51 UTC

Chunker and the head of phrase

Hi,

To the application I am developing it is important to know the head of a
chunk.

I added a * to the chunk tag to mark tokena that are the head of the phrase.
For example I have:

Me pron-pers *B-NP
pergunto v-fin B-VP
sempre adv *B-ADVP
quem pron-indp *B-NP
podia v-fin B-VP
ter v-inf I-VP
sido v-pcp I-VP
aquele pron-det B-NP
jovem adj I-NP
alemão n *I-NP
. . O

It is working OK and the F-1 is almost the same as if I there was no head
mark.
But I have some issues. With this mark the method Chunker.chunkAsSpans() and
the UIMA Chunker doesn't work properly because the current implementation
don't know how to handle the * while computing the spans.

I would like to ask you if adding it to OpenNLP is a good idea. If yes I
would change the trunk code to handle this head symbol, or maybe you should
give me some advise on how to do that without the need of changing the
current implementation.

Thanks,
William

Re: Chunker and the head of phrase

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Hi!

Thank you for your advise. I implemented some rules in the application and
it looks fine. There is no need to complicate the chunker.

William

On Fri, Aug 5, 2011 at 11:21 AM, Jason Baldridge
<ja...@gmail.com>wrote:

> It's not clear to me that head identification should be done as part of the
> prediction unless it improves performance across a couple of languages.
> With
> things as small as chunks, I'm guessing regular expressions, or a secondary
> head prediction model would do the trick. Any reasons to complicate the
> chunker itself?
>
> Jason
>
> On Wed, Aug 3, 2011 at 8:38 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>
> > On 8/3/11 3:24 PM, william.colen@gmail.com wrote:
> >
> >> It would be available for other languages. Just need to add the mark to
> >> the
> >> corpus tags. I think it is much better to use the Chunker because it is
> >> faster and adding the head (some people call it main).
> >>
> >
> > I see, and that depends on training data which labels the head. Do you
> know
> > of any for other languages?
> >
> > Maybe we should have a dedicated head finder as part of the parser, which
> > could also run stand-alone.
> >
> > Would be nice to know what Jason thinks.
> >
> > In the coref component we have several models which could also be
> > interesting
> > for some people to use without the other coref stuff, for example the
> model
> > to
> > label the gender of an entity.
> >
> > Jörn
> >
>
>
>
> --
> Jason Baldridge
> Assistant Professor, Department of Linguistics
> The University of Texas at Austin
> http://www.jasonbaldridge.com
> http://twitter.com/jasonbaldridge
>

Re: Chunker and the head of phrase

Posted by Jason Baldridge <ja...@gmail.com>.

It's not clear to me that head identification should be done as part of the
prediction unless it improves performance across a couple of languages. With
things as small as chunks, I'm guessing regular expressions, or a secondary
head prediction model would do the trick. Any reasons to complicate the
chunker itself?

Jason

On Wed, Aug 3, 2011 at 8:38 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 8/3/11 3:24 PM, william.colen@gmail.com wrote:
>
>> It would be available for other languages. Just need to add the mark to
>> the
>> corpus tags. I think it is much better to use the Chunker because it is
>> faster and adding the head (some people call it main).
>>
>
> I see, and that depends on training data which labels the head. Do you know
> of any for other languages?
>
> Maybe we should have a dedicated head finder as part of the parser, which
> could also run stand-alone.
>
> Would be nice to know what Jason thinks.
>
> In the coref component we have several models which could also be
> interesting
> for some people to use without the other coref stuff, for example the model
> to
> label the gender of an entity.
>
> Jörn
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: Chunker and the head of phrase

Posted by Jörn Kottmann <ko...@gmail.com>.

On 8/3/11 3:24 PM, william.colen@gmail.com wrote:
> It would be available for other languages. Just need to add the mark to the
> corpus tags. I think it is much better to use the Chunker because it is
> faster and adding the head (some people call it main).

I see, and that depends on training data which labels the head. Do you know
of any for other languages?

Maybe we should have a dedicated head finder as part of the parser, which
could also run stand-alone.

Would be nice to know what Jason thinks.

In the coref component we have several models which could also be 
interesting
for some people to use without the other coref stuff, for example the 
model to
label the gender of an entity.

Jörn

Re: Chunker and the head of phrase

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

On Wed, Aug 3, 2011 at 4:28 AM, Jörn Kottmann <ko...@gmail.com> wrote:

>
> would change the trunk code to handle this head symbol, or maybe you should
>> give me some advise on how to do that without the need of changing the
>> current implementation.
>> I would like to ask you if adding it to OpenNLP is a good idea. If yes I
>>
>
> Would it be available for other languages also?
> Maybe most people who needs this might just use the parser.
>

It would be available for other languages. Just need to add the mark to the
corpus tags. I think it is much better to use the Chunker because it is
faster and adding the head (some people call it main).

> Anyway it should be easy to extend the ChunkerME class in a way that
> modifying
> the labels as you did is possible without modifying the OpenNLP code.
>

Yes, it is easy. I just changed the chunkAsSpans method and added a new
class called HeadedSpan. It is easy to do it as an extension, but I will not
be able to use the command line tools to trains, execute and evaluate
anymore. But I think it is the way to go if it is not common to other
languages.

Thanks
William

Re: Chunker and the head of phrase

Posted by Jörn Kottmann <ko...@gmail.com>.

On 8/2/11 7:42 PM, william.colen@gmail.com wrote:
> Hi,
>
> To the application I am developing it is important to know the head of a
> chunk.
>
> I added a * to the chunk tag to mark tokena that are the head of the phrase.
> For example I have:
>
> Me pron-pers *B-NP
> pergunto v-fin B-VP
> sempre adv *B-ADVP
> quem pron-indp *B-NP
> podia v-fin B-VP
> ter v-inf I-VP
> sido v-pcp I-VP
> aquele pron-det B-NP
> jovem adj I-NP
> alemão n *I-NP
> . . O
>
> It is working OK and the F-1 is almost the same as if I there was no head
> mark.
> But I have some issues. With this mark the method Chunker.chunkAsSpans() and
> the UIMA Chunker doesn't work properly because the current implementation
> don't know how to handle the * while computing the spans.
>
> I would like to ask you if adding it to OpenNLP is a good idea. If yes I
> would change the trunk code to handle this head symbol, or maybe you should
> give me some advise on how to do that without the need of changing the
> current implementation.
>

Would it be available for other languages also?
Maybe most people who needs this might just use the parser.

Anyway it should be easy to extend the ChunkerME class in a way that 
modifying
the labels as you did is possible without modifying the OpenNLP code.

Jörn