You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Giorgio Valoti <gi...@me.com> on 2013/09/17 08:55:32 UTC

Getting started with OpenNLP and POS tagger

Hi all,
this is my first post to the list. I’ve tried to gather some info from the documentation and googling around but I haven’t found a satisfying answer to the following questions. Please tell me where to RTFM if some of these questions belong to some FAQ or are off-topic.

It seems there’s no way to incrementally train the POS tagger nor to parallelize this task. Is this correct? 

If the only way to train the POS tagger is in one single shot, how can I estimate memory requirements for the JVM? In other words, given, say, a 1GB training corpus, is there a way to estimate how much RAM would it be needed?

Finally, I have tried to use the `-ngram` switch:
> opennlp POSTaggerTrainer.conllx -type maxent -ngram 3 ... <other options as usual: -lang -model -data -encoding>

but I get this error:
> Building ngram dictionary ... IO error while building NGram Dictionary: Stream not marked
> Stream not marked
> java.io.IOException: Stream not marked
>         at java.io.BufferedReader.reset(BufferedReader.java:485)
>         at opennlp.tools.util.PlainTextByLineStream.reset(PlainTextByLineStream.java:79)
>         at opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
>         at opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
>         at opennlp.tools.cmdline.postag.POSTaggerTrainerTool.run(POSTaggerTrainerTool.java:80)
>         at opennlp.tools.cmdline.CLI.main(CLI.java:222)


But I can’t find out what I’m doing wrong.


Any help really appreciated.

--
Giorgio Valoti

Re: Getting started with OpenNLP and POS tagger

Posted by Giorgio Valoti <gi...@me.com>.

Il giorno 17/set/2013, alle ore 10:18, Jörn Kottmann ha scritto:

> On 09/17/2013 09:53 AM, Giorgio Valoti wrote:
>> <http://www.corpusitaliano.it/en/index.html>  The whole corpus is well over 9GB. It’s not my plan to analyze the whole thing, of course! Do you think would it realistic to use the evaluation tool to decide a reasonable size for the corpus? I’m not an expert, but I guess there’s no point in analyzed that many data if you can achieve a good enough accuracy with a much smaller sample, right?
> 
> The model performance depends on the quality of your training data, The description says that the corpus is in part manually corrected for annotations. I would suggest to only train on these parts if possible, because the other parts are probably less accurate.

Unfortunately, it seems there’s no way to which part are manually corrected. :( I’ve contacted the site; we’ll see.

> 
> Depending on the performance of the model on your data, you could annotate some of your documents and add them to the training data, this usually helps a lot.


--
Giorgio Valoti

Re: Getting started with OpenNLP and POS tagger

Posted by Jörn Kottmann <ko...@gmail.com>.

On 09/17/2013 09:53 AM, Giorgio Valoti wrote:
> <http://www.corpusitaliano.it/en/index.html>  The whole corpus is well over 9GB. It’s not my plan to analyze the whole thing, of course! Do you think would it realistic to use the evaluation tool to decide a reasonable size for the corpus? I’m not an expert, but I guess there’s no point in analyzed that many data if you can achieve a good enough accuracy with a much smaller sample, right?

The model performance depends on the quality of your training data, The 
description says that the corpus is in part manually corrected for 
annotations. I would suggest to only train on these parts if possible, 
because the other parts are probably less accurate.

Depending on the performance of the model on your data, you could 
annotate some of your documents and add them to the training data, this 
usually helps a lot.

Jörn

Re: Getting started with OpenNLP and POS tagger

Posted by Giorgio Valoti <gi...@me.com>.

Il giorno 17/set/2013, alle ore 09:19, Jörn Kottmann ha scritto:

> On 09/17/2013 08:55 AM, Giorgio Valoti wrote:
>> Hi all,
>> this is my first post to the list. I’ve tried to gather some info from the documentation and googling around but I haven’t found a satisfying answer to the following questions. Please tell me where to RTFM if some of these questions belong to some FAQ or are off-topic.
>> 
>> It seems there’s no way to incrementally train the POS tagger nor to parallelize this task. Is this correct?
>> 
>> If the only way to train the POS tagger is in one single shot, how can I estimate memory requirements for the JVM? In other words, given, say, a 1GB training corpus, is there a way to estimate how much RAM would it be needed?
>> 
>> Finally, I have tried to use the `-ngram` switch:
>>> opennlp POSTaggerTrainer.conllx -type maxent -ngram 3 ... <other options as usual: -lang -model -data -encoding>
>> but I get this error:
>>> Building ngram dictionary ... IO error while building NGram Dictionary: Stream not marked
>>> Stream not marked
>>> java.io.IOException: Stream not marked
>>>         at java.io.BufferedReader.reset(BufferedReader.java:485)
>>>         at opennlp.tools.util.PlainTextByLineStream.reset(PlainTextByLineStream.java:79)
>>>         at opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
>>>         at opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
>>>         at opennlp.tools.cmdline.postag.POSTaggerTrainerTool.run(POSTaggerTrainerTool.java:80)
>>>         at opennlp.tools.cmdline.CLI.main(CLI.java:222)
>> 
>> But I can’t find out what I’m doing wrong.
>> 
> 
> Looks like it tries to reset the stream, but that doesn't seem to work. Do you use 1.5.3?
> Please open a jira issue for this so we can fix it.

Yes, it’s 1.5.3. I’ll open it ASAP.

> 
> Usually the pos tagger is trained without this ngram option, it is some old left-over experiment in the code which didn't turn
> out to improve things.

Ah ok, that’s good to know. In fact, I got pretty good results without it, but I was wondering if the -ngram could deliver even more precise results

> 
> If you have that much data you probably want to use the Two Pass Data Indxer, I am not sure if the pos tagger is doing that by default,
> a higher cutoff might help to reduce the required memory, and otherwise just try to give the process a couple of gigabytes of ram.
> 
> On which data set are you training? A gigabyte of training data is quite a lot ...

<http://www.corpusitaliano.it/en/index.html> The whole corpus is well over 9GB. It’s not my plan to analyze the whole thing, of course! Do you think would it realistic to use the evaluation tool to decide a reasonable size for the corpus? I’m not an expert, but I guess there’s no point in analyzed that many data if you can achieve a good enough accuracy with a much smaller sample, right?


Ciao

--
Giorgio Valoti

Re: Getting started with OpenNLP and POS tagger

Posted by Jörn Kottmann <ko...@gmail.com>.

On 09/17/2013 08:55 AM, Giorgio Valoti wrote:
> Hi all,
> this is my first post to the list. I’ve tried to gather some info from the documentation and googling around but I haven’t found a satisfying answer to the following questions. Please tell me where to RTFM if some of these questions belong to some FAQ or are off-topic.
>
> It seems there’s no way to incrementally train the POS tagger nor to parallelize this task. Is this correct?
>
> If the only way to train the POS tagger is in one single shot, how can I estimate memory requirements for the JVM? In other words, given, say, a 1GB training corpus, is there a way to estimate how much RAM would it be needed?
>
> Finally, I have tried to use the `-ngram` switch:
>> opennlp POSTaggerTrainer.conllx -type maxent -ngram 3 ... <other options as usual: -lang -model -data -encoding>
> but I get this error:
>> Building ngram dictionary ... IO error while building NGram Dictionary: Stream not marked
>> Stream not marked
>> java.io.IOException: Stream not marked
>>          at java.io.BufferedReader.reset(BufferedReader.java:485)
>>          at opennlp.tools.util.PlainTextByLineStream.reset(PlainTextByLineStream.java:79)
>>          at opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
>>          at opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
>>          at opennlp.tools.cmdline.postag.POSTaggerTrainerTool.run(POSTaggerTrainerTool.java:80)
>>          at opennlp.tools.cmdline.CLI.main(CLI.java:222)
>
> But I can’t find out what I’m doing wrong.
>

Looks like it tries to reset the stream, but that doesn't seem to work. 
Do you use 1.5.3?
Please open a jira issue for this so we can fix it.

Usually the pos tagger is trained without this ngram option, it is some 
old left-over experiment in the code which didn't turn
out to improve things.

If you have that much data you probably want to use the Two Pass Data 
Indxer, I am not sure if the pos tagger is doing that by default,
a higher cutoff might help to reduce the required memory, and otherwise 
just try to give the process a couple of gigabytes of ram.

On which data set are you training? A gigabyte of training data is quite 
a lot ...

HTH,
Jörn