You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Yakov Keranchuk <ya...@gmail.com> on 2013/10/02 13:28:00 UTC

Re: category tagging

Hello Svetoslav,

My primary goal is categorization of nouns and verbs in specific context
for further automatic analysis (in our example, find all animals and
actions). After some investiagation of OpenNlp I suggested that POS tagger
is most suitable for this purpose. I would appreciate any advice in this
area.

Also I would like to clarify usage of training tools:

opennlp POSTaggerTrainer -type maxent -model en-pos-maxent.bin \
                           -lang en -data en-pos.train -encoding UTF-8


1. Is .train file a source for learning? What format should we use?
2. Is en-pos-maxent.bin (in this case) an output model file?

Sorry for newbie questions :)
Best regards,
Yakov


On Thu, Aug 29, 2013 at 5:08 PM, Svetoslav Marinov <
svetoslav.marinov@findwise.com> wrote:

> Hi Yakov,
>
> Yes, you can use the POS tagger to tag with whatever categories you choose.
>
> If we take your original example "*The quick brown fox_animal jumps_action
> over the lazy dog_animal"
> make sure you tag all tokens,e.g. "The_NA quick_NA brown_NA fox_animal
> jumps_action over_NA the_NA lazy dog_animal", where NA just means
> not-applicable. You choose your categories... Then you can only extract
> those words and categories you are interested in.
>
> Then you'll need to tag some data you can train on, about 15 000 examples
> or more. You can have a POS tagging dictionary in addition, which will help
> diminish the search space of possible tags for a token.
>
> You can have the same tags across languages but each language should have
> its own training data and dictionary.
>
> However, I am not sure about how successful the approach will be, where
> you only need to do partial annotation.
>
> What do you want to use it for? Maybe there are better options...
>
> Svetoslav
> ________________________________________
> Från: Yakov Keranchuk <ya...@gmail.com>
> Skickat: den 29 augusti 2013 12:44
> Till: users@opennlp.apache.org
> Ämne: Re: category tagging
>
> So I found simple example in sources:
>
> WordTagSampleStreamTest.java, it parses string "This_x1 is_x2 a_x3 test_x4
> sentence_x5 ._x6" using POSSample.
>
> As I understand, with normal approach there are few steps for each
> language:
> 1. collect data for model
> 2. create POS dictionary like this:
> <dictionary>
> <entry tags="x1">
> <token>This</token>
> </entry>
> <entry tags="x2">
> <token>is</token>
> </entry>
> <entry tags="x3">
> <token>a</token>
> </entry>
> ...
>
> 3. learn model with this dictionary
>
> Is it right approach? Is POS Tagger appropriate for this task?
>
> Thanks in advance,
> Yakov
>
> On Tue, Aug 27, 2013 at 6:31 PM, Yakov Keranchuk
> <ya...@gmail.com>wrote:
>
> > Hi
> >
> > Is it possible to make tagging for tokens with own rules?
> > Example: *The quick brown fox_animal jumps_action over the lazy
> dog_animal
> > *
> > *
> > *
> > Do we need to create custom dictionary for POS tagger?
> > If it so can there be only one dictionary for a few languages?
> >
> > Best regards,
> > Yakov
> >
>

SV: category tagging

Posted by Svetoslav Marinov <sv...@findwise.com>.

Hi Yakov,

1. Yes, the .train file is the source for learning. The format is as described in the documentation: Sentences need to be tokenized and each token has a part of speech tag assigned to it, e.g. Lions_NOUN eat_VERB zebras_NOUN ._PUNCT

2. Yes, the en-pos-maxent.bin is the output model file. But you can call it anything you want. You can also call the training file anything you like.

If your text is in English you can start by using the pre-trained models available here:
http://opennlp.sourceforge.net/models-1.5/

Otherwise, if it's Russian you will need a corpus to train a POS tagger.

Best,
Svetoslav

________________________________________
Från: Yakov Keranchuk <ya...@gmail.com>
Skickat: den 2 oktober 2013 13:28
Till: users@opennlp.apache.org
Ämne: Re: category tagging

Hello Svetoslav,

My primary goal is categorization of nouns and verbs in specific context
for further automatic analysis (in our example, find all animals and
actions). After some investiagation of OpenNlp I suggested that POS tagger
is most suitable for this purpose. I would appreciate any advice in this
area.

Also I would like to clarify usage of training tools:

opennlp POSTaggerTrainer -type maxent -model en-pos-maxent.bin \
                           -lang en -data en-pos.train -encoding UTF-8


1. Is .train file a source for learning? What format should we use?
2. Is en-pos-maxent.bin (in this case) an output model file?

Sorry for newbie questions :)
Best regards,
Yakov


On Thu, Aug 29, 2013 at 5:08 PM, Svetoslav Marinov <
svetoslav.marinov@findwise.com> wrote:

> Hi Yakov,
>
> Yes, you can use the POS tagger to tag with whatever categories you choose.
>
> If we take your original example "*The quick brown fox_animal jumps_action
> over the lazy dog_animal"
> make sure you tag all tokens,e.g. "The_NA quick_NA brown_NA fox_animal
> jumps_action over_NA the_NA lazy dog_animal", where NA just means
> not-applicable. You choose your categories... Then you can only extract
> those words and categories you are interested in.
>
> Then you'll need to tag some data you can train on, about 15 000 examples
> or more. You can have a POS tagging dictionary in addition, which will help
> diminish the search space of possible tags for a token.
>
> You can have the same tags across languages but each language should have
> its own training data and dictionary.
>
> However, I am not sure about how successful the approach will be, where
> you only need to do partial annotation.
>
> What do you want to use it for? Maybe there are better options...
>
> Svetoslav
> ________________________________________
> Från: Yakov Keranchuk <ya...@gmail.com>
> Skickat: den 29 augusti 2013 12:44
> Till: users@opennlp.apache.org
> Ämne: Re: category tagging
>
> So I found simple example in sources:
>
> WordTagSampleStreamTest.java, it parses string "This_x1 is_x2 a_x3 test_x4
> sentence_x5 ._x6" using POSSample.
>
> As I understand, with normal approach there are few steps for each
> language:
> 1. collect data for model
> 2. create POS dictionary like this:
> <dictionary>
> <entry tags="x1">
> <token>This</token>
> </entry>
> <entry tags="x2">
> <token>is</token>
> </entry>
> <entry tags="x3">
> <token>a</token>
> </entry>
> ...
>
> 3. learn model with this dictionary
>
> Is it right approach? Is POS Tagger appropriate for this task?
>
> Thanks in advance,
> Yakov
>
> On Tue, Aug 27, 2013 at 6:31 PM, Yakov Keranchuk
> <ya...@gmail.com>wrote:
>
> > Hi
> >
> > Is it possible to make tagging for tokens with own rules?
> > Example: *The quick brown fox_animal jumps_action over the lazy
> dog_animal
> > *
> > *
> > *
> > Do we need to create custom dictionary for POS tagger?
> > If it so can there be only one dictionary for a few languages?
> >
> > Best regards,
> > Yakov
> >
>