You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by "william.colen@gmail.com" <wi...@gmail.com> on 2011/06/15 21:07:38 UTC

Abbreviation in SentenceDetector

Hi,

I have a few questions about abbreviation in sentence detector. I'd like to
understand how it is working and improve it if possible.

1) How is the setence detector using the abbreviation dictionary? All train
methods in SentenceDetectorME takes an abbreviation dictionary as argument,
but is only saving it to the model. It is not using the dictionary to create
the context generator, but it should, shouldn't it?

2) The command line trainer does not allow to pass an abbreviation
dictionary. Maybe it should allow to pass a file name that contains the
dictionary.

3) Maybe we should include tools to extract the abbreviation dictionary from
the train corpus. Optionally this could be executed during training too.


What do you think?

Re: Abbreviation in SentenceDetector

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/6/11 3:44 PM, william.colen@gmail.com wrote:
> Sorry for the late answer...
>
> On Tue, Jun 28, 2011 at 4:43 PM, Jörn Kottmann<ko...@gmail.com>  wrote:
>
>> On 6/15/11 9:07 PM, william.colen@gmail.com wrote:
>>
>>> 1) How is the setence detector using the abbreviation dictionary? All
>>> train
>>> methods in SentenceDetectorME takes an abbreviation dictionary as
>>> argument,
>>> but is only saving it to the model. It is not using the dictionary to
>>> create
>>> the context generator, but it should, shouldn't it?
>>>
>> I am not sure how the dictionary is used, or what the intent was.
>> Do we have features in the sentence detectors which are based on
>> a dictionary?
>>
> Yes, we have. The the constructor of the DefaultSDContextGenerator takes a
> Set<String>  inducedAbbreviations as argument and it is used to populate the
> contextual features. This constructor is not used anywhere inside the
> project.

+1 to fix this, and add proper support for it again.

> BTW, shouldn't we have something similar in Tokenizer? I notice that lot of
> the false positives of the Tokenizer was caused by abbreviations. My feeling
> is that there are so many cases were the token should be separated from the
> dot that it will always split if it.

+1 to add dictionary support to the tokenizer also.
>> Lets get that the dictionary support in a good state again.
>>
> I'll start working on that soon.
>

Nice, please open jiras for the two changes.

Jörn

Re: Abbreviation in SentenceDetector

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Sorry for the late answer...

On Tue, Jun 28, 2011 at 4:43 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 6/15/11 9:07 PM, william.colen@gmail.com wrote:
>
>> 1) How is the setence detector using the abbreviation dictionary? All
>> train
>> methods in SentenceDetectorME takes an abbreviation dictionary as
>> argument,
>> but is only saving it to the model. It is not using the dictionary to
>> create
>> the context generator, but it should, shouldn't it?
>>
>
> I am not sure how the dictionary is used, or what the intent was.
> Do we have features in the sentence detectors which are based on
> a dictionary?
>

Yes, we have. The the constructor of the DefaultSDContextGenerator takes a
Set<String> inducedAbbreviations as argument and it is used to populate the
contextual features. This constructor is not used anywhere inside the
project.

BTW, shouldn't we have something similar in Tokenizer? I notice that lot of
the false positives of the Tokenizer was caused by abbreviations. My feeling
is that there are so many cases were the token should be separated from the
dot that it will always split if it.

> Lets get that the dictionary support in a good state again.
>

I'll start working on that soon.
Thanks

Re: Abbreviation in SentenceDetector

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/15/11 9:07 PM, william.colen@gmail.com wrote:
> 1) How is the setence detector using the abbreviation dictionary? All train
> methods in SentenceDetectorME takes an abbreviation dictionary as argument,
> but is only saving it to the model. It is not using the dictionary to create
> the context generator, but it should, shouldn't it?

I am not sure how the dictionary is used, or what the intent was.
Do we have features in the sentence detectors which are based on
a dictionary?

Lets get that the dictionary support in a good state again.

Jörn

Re: Abbreviation in SentenceDetector

Posted by Jason Baldridge <ja...@gmail.com>.

On Wed, Jun 15, 2011 at 2:07 PM, william.colen@gmail.com <
william.colen@gmail.com> wrote:

> Hi,
>
> I have a few questions about abbreviation in sentence detector. I'd like to
> understand how it is working and improve it if possible.
>
> 1) How is the setence detector using the abbreviation dictionary? All train
> methods in SentenceDetectorME takes an abbreviation dictionary as argument,
> but is only saving it to the model. It is not using the dictionary to
> create
> the context generator, but it should, shouldn't it?
>
>
I thought it did, though I haven't looked at that bit of code for a while.


> 2) The command line trainer does not allow to pass an abbreviation
> dictionary. Maybe it should allow to pass a file name that contains the
> dictionary.
>
>
+1


> 3) Maybe we should include tools to extract the abbreviation dictionary
> from
> the train corpus. Optionally this could be executed during training too.
>
>
Doing that extraction actually requires a bit of work to figure out what is
an abbreviation.  Something of interest here is PUNKT, an unsupervised
method for detecting sentences/abbreviations:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf

Implementation in NLTK:
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html

-Jason

-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge