You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Jörn Kottmann <ko...@gmail.com> on 2014/02/19 11:01:41 UTC

Sequence coding

Hi all,

the chunker and name finder both use IOB2 sequence coding. The logic
to do that is hard coded in both components.

I would like to suggest that we introduce a SequenceCodec interface to 
abstract
this code and make it replaceable with different sequence codecs.
This will allow us to reuse the sequence codec in both components, and 
make it
replaceable with other sequence codecs such as BILOU.

On my NER test datasets the F-Measure went up or down by around 1% depending
on the machine learner and data set with BILOU coding compared to IOB2 
coding.

I didn't do any testing in the chunker.

Any opinions? Is it worth the effort?

Jörn

Re: Sequence coding

Posted by Jörn Kottmann <ko...@gmail.com>.
On 02/19/2014 01:25 PM, William Colen wrote:
> Is the SequenceValidator the only thing we need to change? If a corpus uses
> BILOU, the formatters need to convert it to IOB2?

The format parsing code creates Span objects. The name finder and 
chunker take these Span objects and
then perform IOB2 coding on them (start, cont, other).

The coding is done in to places, first during training the Span are 
encoded, and during tagging the tag sequences
are decoded into Span objects again.

An interface like this could work for the name finder (didn't check the 
chunker yet):
public interface class SequenceCodec {
   Span[] decode(List<String> c);
   String[] encode(Span names[], int length);
   SequenceValidator createSequenceValidator();
}

The Sequence Validator depends of course on the used codec and could be 
created by a factory
method.

Some machine learners e.g. Mallet CRF don't support our sequence 
validation. I am not yet sure how we
handle that case.

Jörn


Re: Sequence coding

Posted by William Colen <wi...@gmail.com>.
Is the SequenceValidator the only thing we need to change? If a corpus uses
BILOU, the formatters need to convert it to IOB2?



2014-02-19 7:01 GMT-03:00 Jörn Kottmann <ko...@gmail.com>:

> Hi all,
>
> the chunker and name finder both use IOB2 sequence coding. The logic
> to do that is hard coded in both components.
>
> I would like to suggest that we introduce a SequenceCodec interface to
> abstract
> this code and make it replaceable with different sequence codecs.
> This will allow us to reuse the sequence codec in both components, and
> make it
> replaceable with other sequence codecs such as BILOU.
>
> On my NER test datasets the F-Measure went up or down by around 1%
> depending
> on the machine learner and data set with BILOU coding compared to IOB2
> coding.
>
> I didn't do any testing in the chunker.
>
> Any opinions? Is it worth the effort?
>
> Jörn
>