You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by "Richard Head Jr." <hs...@yahoo.com> on 2013/04/15 02:31:06 UTC

Is Using NER The Right Approach?

I have a bunch of sentences like the following: 

Guacamole Dip: 5 Hass Avocados, Jalapeno Puree with Salt and BHT (preservative).

They are standalone, i.e., they are not contained within a larger paragraph/document structure.

I want to tag various words, creating the following: 

Guacamole Dip: 5 Hass <START:term>Avocados<END>, <START:term>Jalapeno<END> Puree with <START:term>Salt<END> and <START:term>BHT<END> (preservative).

Looking through the mailing list for guidance, I came across this: 

http://mail-archives.apache.org/mod_mbox/opennlp-users/201205.mbox/%3C4FA1EE7E.2080608%40gmail.com%3E

Which made me think that, before going though a 100 or so documents and tagging the words to create training data, I should get some clarification on the following:

1. Is NER the right tool for this?
2. My training data is somewhat small (~100 sentences) will this stymie my goal above?
3. Were the poor results the gentleman had with Italian addresses in part do to a bug mentioned here:
http://mail-archives.apache.org/mod_mbox/opennlp-users/201205.mbox/%3C4FA1EF10.2020904%40gmail.com%3E
4. Is it possible to use a text file containing only terms, or a tab delimited file like the ones the Stanford NER uses?

Thanks in advance.


Re: Is Using NER The Right Approach?

Posted by James Kosin <ja...@gmail.com>.
On 4/17/2013 12:15 AM, Richard Head Jr. wrote:
>
> --- On Mon, 4/15/13, Jörn Kottmann <ko...@gmail.com> wrote:
>> Yes, the NER should be capable of detecting the terms, but
>> you could also try to use a dictionary.
> Are you referring to a POS dictionary? I would have just 2 parts of speech: the terms and the other words, correct? What's the advantage of using NER over POS?
No, we are talking about the DictionaryNameFinder component.
>
>> Your training data is too small, especially when you train
>> with a cutoff of 5 and the maxent model,
>> the perceptron will work better.
> So perception is good for a small set of training data? Is a maxent even necessary when words are not composed of other words?
>
>> Label more data until you have a few  thousand sentences.
> Yes, this is my problem. I don't have thousands of sentences and I'm afraid to take the time and label the 100 or so that I have only for it to fail.
>
> Is there a (dis)advantage to training with 1000 long sentences over say, 2500 short ones?
>
> Thanks!
Train with sentences in your domain.  If all your sentences you are 
parsing are short then train on short.  The disadvantage is really more 
so on getting a good sample space to train with.  If you don't have 
many, then the training just trains on the few .... meaning your model 
won't work well.


Re: Is Using NER The Right Approach?

Posted by "Richard Head Jr." <hs...@yahoo.com>.

--- On Mon, 4/15/13, Jörn Kottmann <ko...@gmail.com> wrote:
> Yes, the NER should be capable of detecting the terms, but
> you could also try to use a dictionary.

Are you referring to a POS dictionary? I would have just 2 parts of speech: the terms and the other words, correct? What's the advantage of using NER over POS?

> Your training data is too small, especially when you train
> with a cutoff of 5 and the maxent model,
> the perceptron will work better. 

So perception is good for a small set of training data? Is a maxent even necessary when words are not composed of other words?

> Label more data until you have a few  thousand sentences.

Yes, this is my problem. I don't have thousands of sentences and I'm afraid to take the time and label the 100 or so that I have only for it to fail. 

Is there a (dis)advantage to training with 1000 long sentences over say, 2500 short ones?

Thanks!



Re: Is Using NER The Right Approach?

Posted by Jörn Kottmann <ko...@gmail.com>.
On 04/15/2013 02:31 AM, Richard Head Jr. wrote:
> I have a bunch of sentences like the following:
>
> Guacamole Dip: 5 Hass Avocados, Jalapeno Puree with Salt and BHT (preservative).
>
> They are standalone, i.e., they are not contained within a larger paragraph/document structure.
>
> I want to tag various words, creating the following:
>
> Guacamole Dip: 5 Hass <START:term>Avocados<END>, <START:term>Jalapeno<END> Puree with <START:term>Salt<END> and <START:term>BHT<END> (preservative).
>
> Looking through the mailing list for guidance, I came across this:
>
> http://mail-archives.apache.org/mod_mbox/opennlp-users/201205.mbox/%3C4FA1EE7E.2080608%40gmail.com%3E
>
> Which made me think that, before going though a 100 or so documents and tagging the words to create training data, I should get some clarification on the following:
>
> 1. Is NER the right tool for this?
> 2. My training data is somewhat small (~100 sentences) will this stymie my goal above?
> 3. Were the poor results the gentleman had with Italian addresses in part do to a bug mentioned here:
> http://mail-archives.apache.org/mod_mbox/opennlp-users/201205.mbox/%3C4FA1EF10.2020904%40gmail.com%3E
> 4. Is it possible to use a text file containing only terms, or a tab delimited file like the ones the Stanford NER uses?
>

Yes, the NER should be capable of detecting the terms, but you could 
also try to use a dictionary.

Your training data is too small, especially when you train with a cutoff 
of 5 and the maxent model,
the perceptron will work better. Label more data until you have a few 
thousand sentences.

The mentioned bug was fixed in 1.5.3, but it only occurred in multi type 
models.
You need complete sentences to train the NER model, just using the terms 
does not work, no we do not support the Stanford format.

Jörn