You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2017/01/03 18:22:57 UTC

Speed up training

Hello,
I have a very very big training set, is there a way to speed up the
training process? I only have changed the Xmx option inside bin/opennlp

Thanks
Damiano

Re: Speed up training

Posted by David Sanderson <da...@crowdcare.com>.

The error message indicates there is not enough virtual memory on the
machine or perhaps that the machine is not managing the allocated memory
effectively.

From my experience increasing the number of threads will speed up the
training of a model; however, increasing this parameter can cause this
error message to appear when the training corpus reaches a certain critical
size. Decreasing the number of threads at that point has enabled me to work
around this issue.

It appears that there is a trade-off at some point between a faster
compilation speed and an effective memory management for the training
process on a given machine.

Isolating the precise causes of this behaviour and determining how to best
control it would be useful research.

For a discussion of Java memory management see this article:
http://stackoverflow.com/questions/25033458/memory-consumed-by-a-thread


D.S.


On Tue, Jan 3, 2017 at 2:02 PM, Damiano Porta <da...@gmail.com>
wrote:

> I always get Exception in thread "main" java.lang.OutOfMemoryError: GC
> overhead limit exceeded
> I am using 5GB on Xmx for a 1GB training data...i will try adding 7GB for
> training.
>
> Could the number of threads helps?
>
> 2017-01-03 19:57 GMT+01:00 Damiano Porta <da...@gmail.com>:
>
> > Ok, i think the best value is matching the number of CPU cores, right?
> >
> > 2017-01-03 19:47 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov>
> > :
> >
> >> I do not believe the perceptron trainer is multithreaded.  But it should
> >> be fast.
> >>
> >> On 1/3/17, 1:44 PM, "Damiano Porta" <da...@gmail.com> wrote:
> >>
> >>     Hi WIlliam, thank you!
> >>     Is there a similar thing for perceptron (perceptron sequence) too?
> >>
> >>     2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:
> >>
> >>     > Damiano,
> >>     >
> >>     > If you are using Maxent, try TrainingParameters.THREADS_PARAM
> >>     >
> >>     > https://opennlp.apache.org/documentation/1.7.0/apidocs/
> >>     > opennlp-tools/opennlp/tools/util/TrainingParameters.html#THR
> >> EADS_PARAM
> >>     >
> >>     > William
> >>     >
> >>     > 2017-01-03 16:27 GMT-02:00 Damiano Porta <damianoporta@gmail.com
> >:
> >>     >
> >>     > > I am training a new postagger and lemmatizer.
> >>     > >
> >>     > > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
> >>     > druss@mail.nih.gov
> >>     > > >:
> >>     > >
> >>     > > > Can you be a little more specific?  What trainer are you
> using?
> >>     > > > Thanks
> >>     > > > Daniel
> >>     > > >
> >>     > > > On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com>
> >> wrote:
> >>     > > >
> >>     > > >     Hello,
> >>     > > >     I have a very very big training set, is there a way to
> >> speed up the
> >>     > > >     training process? I only have changed the Xmx option
> inside
> >>     > > bin/opennlp
> >>     > > >
> >>     > > >     Thanks
> >>     > > >     Damiano
> >>     > > >
> >>     > > >
> >>     > > >
> >>     > >
> >>     >
> >>
> >>
> >>
> >
>



-- 
David Sanderson
Natural Language Processing Developer
CrowdCare Corporation
wysdom.com

Re: Speed up training

Posted by Damiano Porta <da...@gmail.com>.

I am training the model in this way:

opennlp POSTaggerTrainer -type maxent -model
/home/damiano/it-pos-maxent-new.bin -lang it -data
/home/damiano/postagger.train -encoding UTF-8

2017-01-03 21:01 GMT+01:00 Damiano Porta <da...@gmail.com>:

> I am using the default postagger tool.
>
> I have many sentences like:
>
> word1_pos word2_pos ...
>
> I did not add anything. I know It is a java problem, but how is it
> possible that 7GB cannot handle a training corpus of 1 GB ?
>
>
>
> Il 03/Gen/2017 20:48, "William Colen" <wi...@gmail.com> ha
> scritto:
>
>> Review your context generator. Maybe it is getting too many features. Try
>> to keep the strings small in the context generator.
>>
>>
>> 2017-01-03 17:02 GMT-02:00 Damiano Porta <da...@gmail.com>:
>>
>> > I always get Exception in thread "main" java.lang.OutOfMemoryError: GC
>> > overhead limit exceeded
>> > I am using 5GB on Xmx for a 1GB training data...i will try adding 7GB
>> for
>> > training.
>> >
>> > Could the number of threads helps?
>> >
>> > 2017-01-03 19:57 GMT+01:00 Damiano Porta <da...@gmail.com>:
>> >
>> > > Ok, i think the best value is matching the number of CPU cores, right?
>> > >
>> > > 2017-01-03 19:47 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
>> > druss@mail.nih.gov>
>> > > :
>> > >
>> > >> I do not believe the perceptron trainer is multithreaded.  But it
>> should
>> > >> be fast.
>> > >>
>> > >> On 1/3/17, 1:44 PM, "Damiano Porta" <da...@gmail.com> wrote:
>> > >>
>> > >>     Hi WIlliam, thank you!
>> > >>     Is there a similar thing for perceptron (perceptron sequence)
>> too?
>> > >>
>> > >>     2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:
>> > >>
>> > >>     > Damiano,
>> > >>     >
>> > >>     > If you are using Maxent, try TrainingParameters.THREADS_PARAM
>> > >>     >
>> > >>     > https://opennlp.apache.org/documentation/1.7.0/apidocs/
>> > >>     > opennlp-tools/opennlp/tools/util/TrainingParameters.html#THR
>> > >> EADS_PARAM
>> > >>     >
>> > >>     > William
>> > >>     >
>> > >>     > 2017-01-03 16:27 GMT-02:00 Damiano Porta <
>> damianoporta@gmail.com
>> > >:
>> > >>     >
>> > >>     > > I am training a new postagger and lemmatizer.
>> > >>     > >
>> > >>     > > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
>> > >>     > druss@mail.nih.gov
>> > >>     > > >:
>> > >>     > >
>> > >>     > > > Can you be a little more specific?  What trainer are you
>> > using?
>> > >>     > > > Thanks
>> > >>     > > > Daniel
>> > >>     > > >
>> > >>     > > > On 1/3/17, 1:22 PM, "Damiano Porta" <
>> damianoporta@gmail.com>
>> > >> wrote:
>> > >>     > > >
>> > >>     > > >     Hello,
>> > >>     > > >     I have a very very big training set, is there a way to
>> > >> speed up the
>> > >>     > > >     training process? I only have changed the Xmx option
>> > inside
>> > >>     > > bin/opennlp
>> > >>     > > >
>> > >>     > > >     Thanks
>> > >>     > > >     Damiano
>> > >>     > > >
>> > >>     > > >
>> > >>     > > >
>> > >>     > >
>> > >>     >
>> > >>
>> > >>
>> > >>
>> > >
>> >
>>
>

Re: Speed up training

Posted by Damiano Porta <da...@gmail.com>.

I am using the default postagger tool.

I have many sentences like:

word1_pos word2_pos ...

I did not add anything. I know It is a java problem, but how is it possible
that 7GB cannot handle a training corpus of 1 GB ?



Il 03/Gen/2017 20:48, "William Colen" <wi...@gmail.com> ha scritto:

> Review your context generator. Maybe it is getting too many features. Try
> to keep the strings small in the context generator.
>
>
> 2017-01-03 17:02 GMT-02:00 Damiano Porta <da...@gmail.com>:
>
> > I always get Exception in thread "main" java.lang.OutOfMemoryError: GC
> > overhead limit exceeded
> > I am using 5GB on Xmx for a 1GB training data...i will try adding 7GB for
> > training.
> >
> > Could the number of threads helps?
> >
> > 2017-01-03 19:57 GMT+01:00 Damiano Porta <da...@gmail.com>:
> >
> > > Ok, i think the best value is matching the number of CPU cores, right?
> > >
> > > 2017-01-03 19:47 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
> > druss@mail.nih.gov>
> > > :
> > >
> > >> I do not believe the perceptron trainer is multithreaded.  But it
> should
> > >> be fast.
> > >>
> > >> On 1/3/17, 1:44 PM, "Damiano Porta" <da...@gmail.com> wrote:
> > >>
> > >>     Hi WIlliam, thank you!
> > >>     Is there a similar thing for perceptron (perceptron sequence) too?
> > >>
> > >>     2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:
> > >>
> > >>     > Damiano,
> > >>     >
> > >>     > If you are using Maxent, try TrainingParameters.THREADS_PARAM
> > >>     >
> > >>     > https://opennlp.apache.org/documentation/1.7.0/apidocs/
> > >>     > opennlp-tools/opennlp/tools/util/TrainingParameters.html#THR
> > >> EADS_PARAM
> > >>     >
> > >>     > William
> > >>     >
> > >>     > 2017-01-03 16:27 GMT-02:00 Damiano Porta <
> damianoporta@gmail.com
> > >:
> > >>     >
> > >>     > > I am training a new postagger and lemmatizer.
> > >>     > >
> > >>     > > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
> > >>     > druss@mail.nih.gov
> > >>     > > >:
> > >>     > >
> > >>     > > > Can you be a little more specific?  What trainer are you
> > using?
> > >>     > > > Thanks
> > >>     > > > Daniel
> > >>     > > >
> > >>     > > > On 1/3/17, 1:22 PM, "Damiano Porta" <damianoporta@gmail.com
> >
> > >> wrote:
> > >>     > > >
> > >>     > > >     Hello,
> > >>     > > >     I have a very very big training set, is there a way to
> > >> speed up the
> > >>     > > >     training process? I only have changed the Xmx option
> > inside
> > >>     > > bin/opennlp
> > >>     > > >
> > >>     > > >     Thanks
> > >>     > > >     Damiano
> > >>     > > >
> > >>     > > >
> > >>     > > >
> > >>     > >
> > >>     >
> > >>
> > >>
> > >>
> > >
> >
>

Re: Speed up training

Posted by William Colen <wi...@gmail.com>.

Review your context generator. Maybe it is getting too many features. Try
to keep the strings small in the context generator.


2017-01-03 17:02 GMT-02:00 Damiano Porta <da...@gmail.com>:

> I always get Exception in thread "main" java.lang.OutOfMemoryError: GC
> overhead limit exceeded
> I am using 5GB on Xmx for a 1GB training data...i will try adding 7GB for
> training.
>
> Could the number of threads helps?
>
> 2017-01-03 19:57 GMT+01:00 Damiano Porta <da...@gmail.com>:
>
> > Ok, i think the best value is matching the number of CPU cores, right?
> >
> > 2017-01-03 19:47 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov>
> > :
> >
> >> I do not believe the perceptron trainer is multithreaded.  But it should
> >> be fast.
> >>
> >> On 1/3/17, 1:44 PM, "Damiano Porta" <da...@gmail.com> wrote:
> >>
> >>     Hi WIlliam, thank you!
> >>     Is there a similar thing for perceptron (perceptron sequence) too?
> >>
> >>     2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:
> >>
> >>     > Damiano,
> >>     >
> >>     > If you are using Maxent, try TrainingParameters.THREADS_PARAM
> >>     >
> >>     > https://opennlp.apache.org/documentation/1.7.0/apidocs/
> >>     > opennlp-tools/opennlp/tools/util/TrainingParameters.html#THR
> >> EADS_PARAM
> >>     >
> >>     > William
> >>     >
> >>     > 2017-01-03 16:27 GMT-02:00 Damiano Porta <damianoporta@gmail.com
> >:
> >>     >
> >>     > > I am training a new postagger and lemmatizer.
> >>     > >
> >>     > > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
> >>     > druss@mail.nih.gov
> >>     > > >:
> >>     > >
> >>     > > > Can you be a little more specific?  What trainer are you
> using?
> >>     > > > Thanks
> >>     > > > Daniel
> >>     > > >
> >>     > > > On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com>
> >> wrote:
> >>     > > >
> >>     > > >     Hello,
> >>     > > >     I have a very very big training set, is there a way to
> >> speed up the
> >>     > > >     training process? I only have changed the Xmx option
> inside
> >>     > > bin/opennlp
> >>     > > >
> >>     > > >     Thanks
> >>     > > >     Damiano
> >>     > > >
> >>     > > >
> >>     > > >
> >>     > >
> >>     >
> >>
> >>
> >>
> >
>

Re: Speed up training

Posted by Damiano Porta <da...@gmail.com>.

I always get Exception in thread "main" java.lang.OutOfMemoryError: GC
overhead limit exceeded
I am using 5GB on Xmx for a 1GB training data...i will try adding 7GB for
training.

Could the number of threads helps?

2017-01-03 19:57 GMT+01:00 Damiano Porta <da...@gmail.com>:

> Ok, i think the best value is matching the number of CPU cores, right?
>
> 2017-01-03 19:47 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>
> :
>
>> I do not believe the perceptron trainer is multithreaded.  But it should
>> be fast.
>>
>> On 1/3/17, 1:44 PM, "Damiano Porta" <da...@gmail.com> wrote:
>>
>>     Hi WIlliam, thank you!
>>     Is there a similar thing for perceptron (perceptron sequence) too?
>>
>>     2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:
>>
>>     > Damiano,
>>     >
>>     > If you are using Maxent, try TrainingParameters.THREADS_PARAM
>>     >
>>     > https://opennlp.apache.org/documentation/1.7.0/apidocs/
>>     > opennlp-tools/opennlp/tools/util/TrainingParameters.html#THR
>> EADS_PARAM
>>     >
>>     > William
>>     >
>>     > 2017-01-03 16:27 GMT-02:00 Damiano Porta <da...@gmail.com>:
>>     >
>>     > > I am training a new postagger and lemmatizer.
>>     > >
>>     > > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
>>     > druss@mail.nih.gov
>>     > > >:
>>     > >
>>     > > > Can you be a little more specific?  What trainer are you using?
>>     > > > Thanks
>>     > > > Daniel
>>     > > >
>>     > > > On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com>
>> wrote:
>>     > > >
>>     > > >     Hello,
>>     > > >     I have a very very big training set, is there a way to
>> speed up the
>>     > > >     training process? I only have changed the Xmx option inside
>>     > > bin/opennlp
>>     > > >
>>     > > >     Thanks
>>     > > >     Damiano
>>     > > >
>>     > > >
>>     > > >
>>     > >
>>     >
>>
>>
>>
>

Re: Speed up training

Posted by Damiano Porta <da...@gmail.com>.

Ok, i think the best value is matching the number of CPU cores, right?

2017-01-03 19:47 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>:

> I do not believe the perceptron trainer is multithreaded.  But it should
> be fast.
>
> On 1/3/17, 1:44 PM, "Damiano Porta" <da...@gmail.com> wrote:
>
>     Hi WIlliam, thank you!
>     Is there a similar thing for perceptron (perceptron sequence) too?
>
>     2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:
>
>     > Damiano,
>     >
>     > If you are using Maxent, try TrainingParameters.THREADS_PARAM
>     >
>     > https://opennlp.apache.org/documentation/1.7.0/apidocs/
>     > opennlp-tools/opennlp/tools/util/TrainingParameters.html#
> THREADS_PARAM
>     >
>     > William
>     >
>     > 2017-01-03 16:27 GMT-02:00 Damiano Porta <da...@gmail.com>:
>     >
>     > > I am training a new postagger and lemmatizer.
>     > >
>     > > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
>     > druss@mail.nih.gov
>     > > >:
>     > >
>     > > > Can you be a little more specific?  What trainer are you using?
>     > > > Thanks
>     > > > Daniel
>     > > >
>     > > > On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com>
> wrote:
>     > > >
>     > > >     Hello,
>     > > >     I have a very very big training set, is there a way to speed
> up the
>     > > >     training process? I only have changed the Xmx option inside
>     > > bin/opennlp
>     > > >
>     > > >     Thanks
>     > > >     Damiano
>     > > >
>     > > >
>     > > >
>     > >
>     >
>
>
>

Re: Speed up training

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.

I do not believe the perceptron trainer is multithreaded.  But it should be fast.

On 1/3/17, 1:44 PM, "Damiano Porta" <da...@gmail.com> wrote:

    Hi WIlliam, thank you!
    Is there a similar thing for perceptron (perceptron sequence) too?
    
    2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:
    
    > Damiano,
    >
    > If you are using Maxent, try TrainingParameters.THREADS_PARAM
    >
    > https://opennlp.apache.org/documentation/1.7.0/apidocs/
    > opennlp-tools/opennlp/tools/util/TrainingParameters.html#THREADS_PARAM
    >
    > William
    >
    > 2017-01-03 16:27 GMT-02:00 Damiano Porta <da...@gmail.com>:
    >
    > > I am training a new postagger and lemmatizer.
    > >
    > > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
    > druss@mail.nih.gov
    > > >:
    > >
    > > > Can you be a little more specific?  What trainer are you using?
    > > > Thanks
    > > > Daniel
    > > >
    > > > On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com> wrote:
    > > >
    > > >     Hello,
    > > >     I have a very very big training set, is there a way to speed up the
    > > >     training process? I only have changed the Xmx option inside
    > > bin/opennlp
    > > >
    > > >     Thanks
    > > >     Damiano
    > > >
    > > >
    > > >
    > >
    >

Re: Speed up training

Posted by William Colen <co...@apache.org>.

No. Unfortunately today only for Maxent.


2017-01-03 16:44 GMT-02:00 Damiano Porta <da...@gmail.com>:

> Hi WIlliam, thank you!
> Is there a similar thing for perceptron (perceptron sequence) too?
>
> 2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:
>
>> Damiano,
>>
>> If you are using Maxent, try TrainingParameters.THREADS_PARAM
>>
>> https://opennlp.apache.org/documentation/1.7.0/apidocs/openn
>> lp-tools/opennlp/tools/util/TrainingParameters.html#THREADS_PARAM
>>
>> William
>>
>> 2017-01-03 16:27 GMT-02:00 Damiano Porta <da...@gmail.com>:
>>
>> > I am training a new postagger and lemmatizer.
>> >
>> > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
>> druss@mail.nih.gov
>> > >:
>> >
>> > > Can you be a little more specific?  What trainer are you using?
>> > > Thanks
>> > > Daniel
>> > >
>> > > On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com> wrote:
>> > >
>> > >     Hello,
>> > >     I have a very very big training set, is there a way to speed up
>> the
>> > >     training process? I only have changed the Xmx option inside
>> > bin/opennlp
>> > >
>> > >     Thanks
>> > >     Damiano
>> > >
>> > >
>> > >
>> >
>>
>
>

Re: Speed up training

Posted by Damiano Porta <da...@gmail.com>.

Hi WIlliam, thank you!
Is there a similar thing for perceptron (perceptron sequence) too?

2017-01-03 19:41 GMT+01:00 William Colen <co...@apache.org>:

> Damiano,
>
> If you are using Maxent, try TrainingParameters.THREADS_PARAM
>
> https://opennlp.apache.org/documentation/1.7.0/apidocs/
> opennlp-tools/opennlp/tools/util/TrainingParameters.html#THREADS_PARAM
>
> William
>
> 2017-01-03 16:27 GMT-02:00 Damiano Porta <da...@gmail.com>:
>
> > I am training a new postagger and lemmatizer.
> >
> > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov
> > >:
> >
> > > Can you be a little more specific?  What trainer are you using?
> > > Thanks
> > > Daniel
> > >
> > > On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com> wrote:
> > >
> > >     Hello,
> > >     I have a very very big training set, is there a way to speed up the
> > >     training process? I only have changed the Xmx option inside
> > bin/opennlp
> > >
> > >     Thanks
> > >     Damiano
> > >
> > >
> > >
> >
>

Re: Speed up training

Posted by William Colen <co...@apache.org>.

Damiano,

If you are using Maxent, try TrainingParameters.THREADS_PARAM

https://opennlp.apache.org/documentation/1.7.0/apidocs/opennlp-tools/opennlp/tools/util/TrainingParameters.html#THREADS_PARAM

William

2017-01-03 16:27 GMT-02:00 Damiano Porta <da...@gmail.com>:

> I am training a new postagger and lemmatizer.
>
> 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <druss@mail.nih.gov
> >:
>
> > Can you be a little more specific?  What trainer are you using?
> > Thanks
> > Daniel
> >
> > On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com> wrote:
> >
> >     Hello,
> >     I have a very very big training set, is there a way to speed up the
> >     training process? I only have changed the Xmx option inside
> bin/opennlp
> >
> >     Thanks
> >     Damiano
> >
> >
> >
>

Re: Speed up training

Posted by Damiano Porta <da...@gmail.com>.

I am training a new postagger and lemmatizer.

2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>:

> Can you be a little more specific?  What trainer are you using?
> Thanks
> Daniel
>
> On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com> wrote:
>
>     Hello,
>     I have a very very big training set, is there a way to speed up the
>     training process? I only have changed the Xmx option inside bin/opennlp
>
>     Thanks
>     Damiano
>
>
>

Re: Speed up training

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.

Can you be a little more specific?  What trainer are you using?
Thanks
Daniel

On 1/3/17, 1:22 PM, "Damiano Porta" <da...@gmail.com> wrote:

    Hello,
    I have a very very big training set, is there a way to speed up the
    training process? I only have changed the Xmx option inside bin/opennlp
    
    Thanks
    Damiano