You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Svetoslav Marinov <sv...@findwise.com> on 2013/04/26 09:06:50 UTC

Size of training data

Hi all,

I'm wondering what is the max size (if such exists) for training a NER model? I have a corpus of 2 600 000 sentences annotated with just one category, 310M in size. However, the training never finishes – 8G memory resulted in java out of memory exception, and when I increased it to 16G it just died with no error message.

Any help about this would be highly appreciated.

Svetoslav

Re: Size of training data

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/29/2013 02:32 PM, Svetoslav Marinov wrote:
> Ok, I hope I do this correctly: The counter for sample object I take from
> sampleStream: ObjectStream<NameSample> sampleStream = new
> NameSampleDataStream(lineStream);
>
> I use sampleStream.read() and the get 468 samples less than the number of
> sentences (which are 2 611 247). Shouldn't sampleStream match the number
> of sentences? I have samples without entities, but I suspect they are more
> than 468. Will check though.
>
> Otherwise I am not sure where to measure how many are processed per
> second. Do you mean during the creation of the NEmodel? Or? How does one
> do that?

You could implement a proxy ObjectStream object which can be inserted 
into the stream,
the call to the read method can then be used to do the counting and 
maybe printing out
the progress every n calls.

The difference could come from empty lines in your training data, only 
non-empty lines are becoming sample objects.

Jörn

Re: Size of training data

Posted by Svetoslav Marinov <sv...@findwise.com>.

Ok, I hope I do this correctly: The counter for sample object I take from
sampleStream: ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream);

I use sampleStream.read() and the get 468 samples less than the number of
sentences (which are 2 611 247). Shouldn't sampleStream match the number
of sentences? I have samples without entities, but I suspect they are more
than 468. Will check though.

Otherwise I am not sure where to measure how many are processed per
second. Do you mean during the creation of the NEmodel? Or? How does one
do that? 

Thank you!

Svetoslav

On 2013-04-29 11:14, "Jörn Kottmann" <ko...@gmail.com> wrote:

>Its a bit hard to diagnose the problem, but my best guess here is that
>for some reason the sample object stream is endless or the feature
>generation
>is very slow.
>
>Can you add a counter to your code which provides the sample object? It
>should not
>exceed your number of sentences, if the stream is endless it might be
>bigger after an hour or two.
>
>Can you measure how many of them are processed per second (should be
>more than 1k samples per second) ,
>if the throughput is too low it might just need a lot of time.
>
>Jörn
>
>On 04/29/2013 10:55 AM, Svetoslav Marinov wrote:
>> Yes, the process is at 100% CPU utilization and this is the only thing I
>> get from the jstack, no matter how many times I repeat it:
>>
>> 2013-04-29 10:47:17
>> Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):
>>
>> "Attach Listener" daemon prio=10 tid=0x00007f31a8001000 nid=0xf42b
>>waiting
>> on condition [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
>> runnable [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
>> waiting on condition [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
>> waiting on condition [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
>> runnable [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
>> Object.wait() [0x00007f31ca3db000]
>>     java.lang.Thread.State: WAITING (on object monitor)
>> 	at java.lang.Object.wait(Native Method)
>> 	- waiting on <0x0000000400b8f660> (a java.lang.ref.ReferenceQueue$Lock)
>> 	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
>> 	- locked <0x0000000400b8f660> (a java.lang.ref.ReferenceQueue$Lock)
>> 	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
>> 	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
>>
>> "Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
>> Object.wait() [0x00007f31ca4dc000]
>>     java.lang.Thread.State: WAITING (on object monitor)
>> 	at java.lang.Object.wait(Native Method)
>> 	- waiting on <0x0000000400b8f5f8> (a java.lang.ref.Reference$Lock)
>> 	at java.lang.Object.wait(Object.java:502)
>> 	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
>> 	- locked <0x0000000400b8f5f8> (a java.lang.ref.Reference$Lock)
>>
>> "main" prio=10 tid=0x00007f31d0007800 nid=0xe267 waiting on condition
>> [0x00007f31d8923000]
>>     java.lang.Thread.State: RUNNABLE
>> 	at java.util.Arrays.copyOfRange(Arrays.java:3221)
>> 	at java.lang.String.<init>(String.java:233)
>> 	at java.lang.StringBuilder.toString(StringBuilder.java:447)
>> 	at
>> 
>>opennlp.tools.util.featuregen.TokenClassFeatureGenerator.createFeatures(T
>>ok
>> enClassFeatureGenerator.java:46)
>> 	at
>> 
>>opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(Windo
>>wF
>> eatureGenerator.java:109)
>> 	at
>> 
>>opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(A
>>gg
>> regatedFeatureGenerator.java:79)
>> 	at
>> 
>>opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(Cache
>>dF
>> eatureGenerator.java:69)
>> 	at
>> 
>>opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultName
>>Co
>> ntextGenerator.java:118)
>> 	at
>> 
>>opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultName
>>Co
>> ntextGenerator.java:37)
>> 	at
>> 
>>opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEve
>>nt
>> Stream.java:103)
>> 	at
>> 
>>opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEvent
>>St
>> ream.java:126)
>> 	at
>> 
>>opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEvent
>>St
>> ream.java:37)
>> 	at
>> 
>>opennlp.tools.util.AbstractEventStream.hasNext(AbstractEventStream.java:7
>>1)
>> 	at opennlp.model.HashSumEventStream.hasNext(HashSumEventStream.java:47)
>> 	at
>> 
>>opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.ja
>>va
>> :126)
>> 	at opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
>> 	at opennlp.model.TrainUtil.train(TrainUtil.java:173)
>> 	at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
>> 	at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)
>>
>> "VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable
>>
>> "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800
>>nid=0xe268
>> runnable
>>
>> "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800
>>nid=0xe269
>> runnable
>>
>> "GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000
>>nid=0xe26a
>> runnable
>>
>> "GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000
>>nid=0xe26b
>> runnable
>>
>> "VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
>> waiting on condition
>>
>> JNI global references: 1139
>>
>>
>>
>> On 2013-04-29 10:26, "Jörn Kottmann" <ko...@gmail.com> wrote:
>>
>>> On 04/29/2013 09:59 AM, Svetoslav Marinov wrote:
>>>> Below is a jstack output. It is not the third day it is running and
>>>> seems
>>>> like the process has hung up somewhere. I still haven't changed the
>>>> indexer to be one pass, so it is still two pass.
>>>>
>>>> I just wonder how long I should wait?
>>> Looks like its still fetching the events from the source, the method
>>> we can see in the stack dump are calculating the hash sum of the
>>>events,
>>> but I doubt
>>> that this is broken.
>>>
>>> Is the process at 100% CPU utilization? Is it still in the hash sum
>>>code
>>> if you repeat the jstack command a few times?
>>>
>>> Jörn
>>>
>>
>
>

Re: Size of training data

Posted by Jörn Kottmann <ko...@gmail.com>.

Its a bit hard to diagnose the problem, but my best guess here is that
for some reason the sample object stream is endless or the feature 
generation
is very slow.

Can you add a counter to your code which provides the sample object? It 
should not
exceed your number of sentences, if the stream is endless it might be 
bigger after an hour or two.

Can you measure how many of them are processed per second (should be 
more than 1k samples per second) ,
if the throughput is too low it might just need a lot of time.

Jörn

On 04/29/2013 10:55 AM, Svetoslav Marinov wrote:
> Yes, the process is at 100% CPU utilization and this is the only thing I
> get from the jstack, no matter how many times I repeat it:
>
> 2013-04-29 10:47:17
> Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):
>
> "Attach Listener" daemon prio=10 tid=0x00007f31a8001000 nid=0xf42b waiting
> on condition [0x0000000000000000]
>     java.lang.Thread.State: RUNNABLE
>
> "Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
> runnable [0x0000000000000000]
>     java.lang.Thread.State: RUNNABLE
>
> "C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
> waiting on condition [0x0000000000000000]
>     java.lang.Thread.State: RUNNABLE
>
> "C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
> waiting on condition [0x0000000000000000]
>     java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
> runnable [0x0000000000000000]
>     java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
> Object.wait() [0x00007f31ca3db000]
>     java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x0000000400b8f660> (a java.lang.ref.ReferenceQueue$Lock)
> 	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
> 	- locked <0x0000000400b8f660> (a java.lang.ref.ReferenceQueue$Lock)
> 	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
> 	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
>
> "Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
> Object.wait() [0x00007f31ca4dc000]
>     java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x0000000400b8f5f8> (a java.lang.ref.Reference$Lock)
> 	at java.lang.Object.wait(Object.java:502)
> 	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
> 	- locked <0x0000000400b8f5f8> (a java.lang.ref.Reference$Lock)
>
> "main" prio=10 tid=0x00007f31d0007800 nid=0xe267 waiting on condition
> [0x00007f31d8923000]
>     java.lang.Thread.State: RUNNABLE
> 	at java.util.Arrays.copyOfRange(Arrays.java:3221)
> 	at java.lang.String.<init>(String.java:233)
> 	at java.lang.StringBuilder.toString(StringBuilder.java:447)
> 	at
> opennlp.tools.util.featuregen.TokenClassFeatureGenerator.createFeatures(Tok
> enClassFeatureGenerator.java:46)
> 	at
> opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowF
> eatureGenerator.java:109)
> 	at
> opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(Agg
> regatedFeatureGenerator.java:79)
> 	at
> opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedF
> eatureGenerator.java:69)
> 	at
> opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
> ntextGenerator.java:118)
> 	at
> opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
> ntextGenerator.java:37)
> 	at
> opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEvent
> Stream.java:103)
> 	at
> opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
> ream.java:126)
> 	at
> opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
> ream.java:37)
> 	at
> opennlp.tools.util.AbstractEventStream.hasNext(AbstractEventStream.java:71)
> 	at opennlp.model.HashSumEventStream.hasNext(HashSumEventStream.java:47)
> 	at
> opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
> :126)
> 	at opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
> 	at opennlp.model.TrainUtil.train(TrainUtil.java:173)
> 	at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
> 	at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)
>
> "VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable
>
> "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800 nid=0xe268
> runnable
>
> "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800 nid=0xe269
> runnable
>
> "GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000 nid=0xe26a
> runnable
>
> "GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000 nid=0xe26b
> runnable
>
> "VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
> waiting on condition
>
> JNI global references: 1139
>
>
>
> On 2013-04-29 10:26, "Jörn Kottmann" <ko...@gmail.com> wrote:
>
>> On 04/29/2013 09:59 AM, Svetoslav Marinov wrote:
>>> Below is a jstack output. It is not the third day it is running and
>>> seems
>>> like the process has hung up somewhere. I still haven't changed the
>>> indexer to be one pass, so it is still two pass.
>>>
>>> I just wonder how long I should wait?
>> Looks like its still fetching the events from the source, the method
>> we can see in the stack dump are calculating the hash sum of the events,
>> but I doubt
>> that this is broken.
>>
>> Is the process at 100% CPU utilization? Is it still in the hash sum code
>> if you repeat the jstack command a few times?
>>
>> Jörn
>>
>

Re: Size of training data

Posted by Svetoslav Marinov <sv...@findwise.com>.

Yes, the process is at 100% CPU utilization and this is the only thing I
get from the jstack, no matter how many times I repeat it:

2013-04-29 10:47:17
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00007f31a8001000 nid=0xf42b waiting
on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
Object.wait() [0x00007f31ca3db000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x0000000400b8f660> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
	- locked <0x0000000400b8f660> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

"Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
Object.wait() [0x00007f31ca4dc000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x0000000400b8f5f8> (a java.lang.ref.Reference$Lock)
	at java.lang.Object.wait(Object.java:502)
	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
	- locked <0x0000000400b8f5f8> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f31d0007800 nid=0xe267 waiting on condition
[0x00007f31d8923000]
   java.lang.Thread.State: RUNNABLE
	at java.util.Arrays.copyOfRange(Arrays.java:3221)
	at java.lang.String.<init>(String.java:233)
	at java.lang.StringBuilder.toString(StringBuilder.java:447)
	at 
opennlp.tools.util.featuregen.TokenClassFeatureGenerator.createFeatures(Tok
enClassFeatureGenerator.java:46)
	at 
opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowF
eatureGenerator.java:109)
	at 
opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(Agg
regatedFeatureGenerator.java:79)
	at 
opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedF
eatureGenerator.java:69)
	at 
opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
ntextGenerator.java:118)
	at 
opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
ntextGenerator.java:37)
	at 
opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEvent
Stream.java:103)
	at 
opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
ream.java:126)
	at 
opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
ream.java:37)
	at 
opennlp.tools.util.AbstractEventStream.hasNext(AbstractEventStream.java:71)
	at opennlp.model.HashSumEventStream.hasNext(HashSumEventStream.java:47)
	at 
opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
:126)
	at opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
	at opennlp.model.TrainUtil.train(TrainUtil.java:173)
	at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
	at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)

"VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800 nid=0xe268
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800 nid=0xe269
runnable 

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000 nid=0xe26a
runnable 

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000 nid=0xe26b
runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
waiting on condition

JNI global references: 1139



On 2013-04-29 10:26, "Jörn Kottmann" <ko...@gmail.com> wrote:

>On 04/29/2013 09:59 AM, Svetoslav Marinov wrote:
>> Below is a jstack output. It is not the third day it is running and
>>seems
>> like the process has hung up somewhere. I still haven't changed the
>> indexer to be one pass, so it is still two pass.
>>
>> I just wonder how long I should wait?
>
>Looks like its still fetching the events from the source, the method
>we can see in the stack dump are calculating the hash sum of the events,
>but I doubt
>that this is broken.
>
>Is the process at 100% CPU utilization? Is it still in the hash sum code
>if you repeat the jstack command a few times?
>
>Jörn
>

Re: Size of training data

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/29/2013 09:59 AM, Svetoslav Marinov wrote:
> Below is a jstack output. It is not the third day it is running and seems
> like the process has hung up somewhere. I still haven't changed the
> indexer to be one pass, so it is still two pass.
>
> I just wonder how long I should wait?

Looks like its still fetching the events from the source, the method
we can see in the stack dump are calculating the hash sum of the events, 
but I doubt
that this is broken.

Is the process at 100% CPU utilization? Is it still in the hash sum code 
if you repeat the jstack command a few times?

Jörn

Re: Size of training data

Posted by Svetoslav Marinov <sv...@findwise.com>.

Hi again, 

Below is a jstack output. It is not the third day it is running and seems
like the process has hung up somewhere. I still haven't changed the
indexer to be one pass, so it is still two pass.

I just wonder how long I should wait?

Thanks!

Svetoslav

------------------------------

Indexing events using cutoff of 6

        Computing event counts...  2013-04-26 14:37:22
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):

"Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
Object.wait() [0x00007f31ca3db000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000400b94808> (a
java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
        - locked <0x0000000400b94808> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

"Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
Object.wait() [0x00007f31ca4dc000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
        - locked <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f31d0007800 nid=0xe267 runnable
[0x00007f31d8923000]
   java.lang.Thread.State: RUNNABLE
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:367)
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:390)
        at 
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:254)
        at java.lang.StringCoding.encode(StringCoding.java:289)
        at java.lang.String.getBytes(String.java:954)
        at 
opennlp.model.HashSumEventStream.next(HashSumEventStream.java:55)
        at 
opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
:127)
        at 
opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
        at opennlp.model.TrainUtil.train(TrainUtil.java:173)
        at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
        at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)

"VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800 nid=0xe268
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800 nid=0xe269
runnable 

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000 nid=0xe26a
runnable 

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000 nid=0xe26b
runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
waiting on condition

JNI global references: 1139

Heap
 PSYoungGen      total 2581440K, used 2388216K [0x00000006aaab0000,
0x000000074b590000, 0x0000000800000000)
  eden space 2530304K, 94% used
[0x00000006aaab0000,0x000000073c6ee120,0x00000007451b0000)
  from space 51136K, 0% used
[0x00000007451b0000,0x00000007451b0000,0x00000007483a0000)
  to   space 48512K, 0% used
[0x0000000748630000,0x0000000748630000,0x000000074b590000)
 PSOldGen        total 167168K, used 167167K [0x0000000400000000,
0x000000040a340000, 0x00000006aaab0000)
  object space 167168K, 99% used
[0x0000000400000000,0x000000040a33fff0,0x000000040a340000)
 PSPermGen       total 21248K, used 4039K [0x00000003f5a00000,
0x00000003f6ec0000, 0x0000000400000000)
  object space 21248K, 19% used
[0x00000003f5a00000,0x00000003f5df1fe8,0x00000003f6ec0000)

2013-04-26 14:39:09
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):


"Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
Object.wait() [0x00007f31ca3db000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000400b94808> (a
java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
        - locked <0x0000000400b94808> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

"Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
Object.wait() [0x00007f31ca4dc000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
        - locked <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f31d0007800 nid=0xe267 runnable
[0x00007f31d8923000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.copyOfRange(Arrays.java:3221)
        at java.lang.String.<init>(String.java:233)
        at java.lang.StringBuilder.toString(StringBuilder.java:447)
        at 
opennlp.tools.util.featuregen.TokenFeatureGenerator.createFeatures(TokenFea
tureGenerator.java:41)
        at 
opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowF
eatureGenerator.java:95)
        at 
opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(Agg
regatedFeatureGenerator.java:79)
        at 
opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedF
eatureGenerator.java:69)
        at 
opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
ntextGenerator.java:118)
        at 
opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
ntextGenerator.java:37)
        at 
opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEvent
Stream.java:103)
        at 
opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
ream.java:126)
        at 
opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
ream.java:37)
        at 
opennlp.tools.util.AbstractEventStream.hasNext(AbstractEventStream.java:71)
        at 
opennlp.model.HashSumEventStream.hasNext(HashSumEventStream.java:47)
        at 
opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
:126)
        at 
opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
        at opennlp.model.TrainUtil.train(TrainUtil.java:173)
        at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
        at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)

"VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800 nid=0xe268
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800 nid=0xe269
runnable 

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000 nid=0xe26a
runnable 

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000 nid=0xe26b
runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
waiting on condition

JNI global references: 1139

Heap
 PSYoungGen      total 2581440K, used 2267572K [0x00000006aaab0000,
0x000000074b590000, 0x0000000800000000)
  eden space 2530304K, 89% used
[0x00000006aaab0000,0x000000073511d138,0x00000007451b0000)
  from space 51136K, 0% used
[0x00000007451b0000,0x00000007451b0000,0x00000007483a0000)
  to   space 48512K, 0% used
[0x0000000748630000,0x0000000748630000,0x000000074b590000)
 PSOldGen        total 167168K, used 167167K [0x0000000400000000,
0x000000040a340000, 0x00000006aaab0000)
  object space 167168K, 99% used
[0x0000000400000000,0x000000040a33fff0,0x000000040a340000)
 PSPermGen       total 21248K, used 4039K [0x00000003f5a00000,
0x00000003f6ec0000, 0x0000000400000000)
  object space 21248K, 19% used
[0x00000003f5a00000,0x00000003f5df1fe8,0x00000003f6ec0000)



On 2013-04-26 13:41, "Jörn Kottmann" <ko...@gmail.com> wrote:

>The Two Pass Data Indexer is the default, if you have a machine with
>enough
>memory you might wanna try the One Pass Data Indexer.
>Anyway, it would be nice to get a jstack to see where is spending its
>time,
>maybe there is an I/O issue?
>
>The training can take very long, but the data indexing should work.
>
>To change the indexer you can set this parameter:
>DataIndexer=OnePass
>
>HTH,
>Jörn
>
>On 04/26/2013 01:17 PM, Svetoslav Marinov wrote:
>> I prefer the API as it gives me more flexibility and fits the overall
>> architecture of our components. But here is part of my set-up:
>>
>> Cutoff 6
>> Iterations 200
>> CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
>> tokens.
>>
>> So, I gave it a whole night and I saw the process was dead in the
>>morning.
>> But I'll give it another try and will let you know.
>>
>> Thank you!
>>
>> Svetoslav
>>
>>
>> On 2013-04-26 12:42, "Jörn Kottmann" <ko...@gmail.com> wrote:
>>
>>> I always edit the opennlp script and change it to what I need.
>>>
>>> Anyway, we have a Two Pass Data Indexer which writes the features to
>>>disk
>>> to save memory during indexing, depending on how you train you might
>>> have a cutoff=5 which eliminates probably a lot of your features and
>>> therefore
>>> saves a lot of memory.
>>>
>>> The indexing might just need a bit of time, how long did you wait?
>>>
>>> Jörn
>>>
>>> On 04/26/2013 12:33 PM, William Colen wrote:
>>>>   From command line you can specify memory using
>>>>
>>>> MAVEN_OPTS="-Xmx4048m"
>>>>
>>>> You can also set it as JVM arguments if you are using from the API:
>>>>
>>>> java -Xmx4048m ...
>>>>
>>>>
>>>>
>>>> On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov <
>>>> svetoslav.marinov@findwise.com> wrote:
>>>>
>>>>> I use the API. Can one specify the memory size via the command line?
>>>>>I
>>>>> think the default there is 1024M? At 8G memory during "computing
>>>>>event
>>>>> counts...", at 16G during indexing: "Computing event counts...  done.
>>>>> 50153300 events
>>>>>           IndexingŠ"
>>>>>
>>>>> Svetoslav
>>>>>
>>>>> On 2013-04-26 09:12, "Jörn Kottmann" <ko...@gmail.com> wrote:
>>>>>
>>>>>> On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
>>>>>>> I'm wondering what is the max size (if such exists) for training a
>>>>>>> NER
>>>>>>> model? I have a corpus of 2 600 000 sentences annotated with just
>>>>>>>one
>>>>>>> category, 310M in size. However, the training never finishes  8G
>>>>>>> memory
>>>>>>> resulted in java out of memory exception, and when I increased it
>>>>>>>to
>>>>>>> 16G
>>>>>>> it just died with no error message.
>>>>>> Do you use the command line interface or the API for the training?
>>>>>> At which stage of the training did you get the out of memory
>>>>>> exception?
>>>>>> Where did it just die when you used 16G of memory (maybe do a
>>>>>>jstack)
>>>>>> ?
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>
>>>
>
>

Re: Size of training data

Posted by Jörn Kottmann <ko...@gmail.com>.

The Two Pass Data Indexer is the default, if you have a machine with enough
memory you might wanna try the One Pass Data Indexer.
Anyway, it would be nice to get a jstack to see where is spending its time,
maybe there is an I/O issue?

The training can take very long, but the data indexing should work.

To change the indexer you can set this parameter:
DataIndexer=OnePass

HTH,
Jörn

On 04/26/2013 01:17 PM, Svetoslav Marinov wrote:
> I prefer the API as it gives me more flexibility and fits the overall
> architecture of our components. But here is part of my set-up:
>
> Cutoff 6
> Iterations 200
> CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
> tokens.
>
> So, I gave it a whole night and I saw the process was dead in the morning.
> But I'll give it another try and will let you know.
>
> Thank you!
>
> Svetoslav
>
>
> On 2013-04-26 12:42, "Jörn Kottmann" <ko...@gmail.com> wrote:
>
>> I always edit the opennlp script and change it to what I need.
>>
>> Anyway, we have a Two Pass Data Indexer which writes the features to disk
>> to save memory during indexing, depending on how you train you might
>> have a cutoff=5 which eliminates probably a lot of your features and
>> therefore
>> saves a lot of memory.
>>
>> The indexing might just need a bit of time, how long did you wait?
>>
>> Jörn
>>
>> On 04/26/2013 12:33 PM, William Colen wrote:
>>>   From command line you can specify memory using
>>>
>>> MAVEN_OPTS="-Xmx4048m"
>>>
>>> You can also set it as JVM arguments if you are using from the API:
>>>
>>> java -Xmx4048m ...
>>>
>>>
>>>
>>> On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov <
>>> svetoslav.marinov@findwise.com> wrote:
>>>
>>>> I use the API. Can one specify the memory size via the command line? I
>>>> think the default there is 1024M? At 8G memory during "computing event
>>>> counts...", at 16G during indexing: "Computing event counts...  done.
>>>> 50153300 events
>>>>           IndexingŠ"
>>>>
>>>> Svetoslav
>>>>
>>>> On 2013-04-26 09:12, "Jörn Kottmann" <ko...@gmail.com> wrote:
>>>>
>>>>> On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
>>>>>> I'm wondering what is the max size (if such exists) for training a
>>>>>> NER
>>>>>> model? I have a corpus of 2 600 000 sentences annotated with just one
>>>>>> category, 310M in size. However, the training never finishes  8G
>>>>>> memory
>>>>>> resulted in java out of memory exception, and when I increased it to
>>>>>> 16G
>>>>>> it just died with no error message.
>>>>> Do you use the command line interface or the API for the training?
>>>>> At which stage of the training did you get the out of memory
>>>>> exception?
>>>>> Where did it just die when you used 16G of memory (maybe do a jstack)
>>>>> ?
>>>>>
>>>>> Jörn
>>>>>
>>>>
>>

Re: Size of training data

Posted by Svetoslav Marinov <sv...@findwise.com>.

I prefer the API as it gives me more flexibility and fits the overall
architecture of our components. But here is part of my set-up:

Cutoff 6
Iterations 200
CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
tokens.

So, I gave it a whole night and I saw the process was dead in the morning.
But I'll give it another try and will let you know.

Thank you!

Svetoslav


On 2013-04-26 12:42, "Jörn Kottmann" <ko...@gmail.com> wrote:

>I always edit the opennlp script and change it to what I need.
>
>Anyway, we have a Two Pass Data Indexer which writes the features to disk
>to save memory during indexing, depending on how you train you might
>have a cutoff=5 which eliminates probably a lot of your features and
>therefore
>saves a lot of memory.
>
>The indexing might just need a bit of time, how long did you wait?
>
>Jörn
>
>On 04/26/2013 12:33 PM, William Colen wrote:
>>  From command line you can specify memory using
>>
>> MAVEN_OPTS="-Xmx4048m"
>>
>> You can also set it as JVM arguments if you are using from the API:
>>
>> java -Xmx4048m ...
>>
>>
>>
>> On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov <
>> svetoslav.marinov@findwise.com> wrote:
>>
>>> I use the API. Can one specify the memory size via the command line? I
>>> think the default there is 1024M? At 8G memory during "computing event
>>> counts...", at 16G during indexing: "Computing event counts...  done.
>>> 50153300 events
>>>          IndexingŠ"
>>>
>>> Svetoslav
>>>
>>> On 2013-04-26 09:12, "Jörn Kottmann" <ko...@gmail.com> wrote:
>>>
>>>> On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
>>>>> I'm wondering what is the max size (if such exists) for training a
>>>>>NER
>>>>> model? I have a corpus of 2 600 000 sentences annotated with just one
>>>>> category, 310M in size. However, the training never finishes  8G
>>>>>memory
>>>>> resulted in java out of memory exception, and when I increased it to
>>>>>16G
>>>>> it just died with no error message.
>>>> Do you use the command line interface or the API for the training?
>>>> At which stage of the training did you get the out of memory
>>>>exception?
>>>> Where did it just die when you used 16G of memory (maybe do a jstack)
>>>>?
>>>>
>>>> Jörn
>>>>
>>>
>>>
>
>

Re: Size of training data

Posted by Jörn Kottmann <ko...@gmail.com>.

I always edit the opennlp script and change it to what I need.

Anyway, we have a Two Pass Data Indexer which writes the features to disk
to save memory during indexing, depending on how you train you might
have a cutoff=5 which eliminates probably a lot of your features and 
therefore
saves a lot of memory.

The indexing might just need a bit of time, how long did you wait?

Jörn

On 04/26/2013 12:33 PM, William Colen wrote:
>  From command line you can specify memory using
>
> MAVEN_OPTS="-Xmx4048m"
>
> You can also set it as JVM arguments if you are using from the API:
>
> java -Xmx4048m ...
>
>
>
> On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov <
> svetoslav.marinov@findwise.com> wrote:
>
>> I use the API. Can one specify the memory size via the command line? I
>> think the default there is 1024M? At 8G memory during "computing event
>> counts...", at 16G during indexing: "Computing event counts...  done.
>> 50153300 events
>>          IndexingŠ"
>>
>> Svetoslav
>>
>> On 2013-04-26 09:12, "Jörn Kottmann" <ko...@gmail.com> wrote:
>>
>>> On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
>>>> I'm wondering what is the max size (if such exists) for training a NER
>>>> model? I have a corpus of 2 600 000 sentences annotated with just one
>>>> category, 310M in size. However, the training never finishes  8G memory
>>>> resulted in java out of memory exception, and when I increased it to 16G
>>>> it just died with no error message.
>>> Do you use the command line interface or the API for the training?
>>> At which stage of the training did you get the out of memory exception?
>>> Where did it just die when you used 16G of memory (maybe do a jstack) ?
>>>
>>> Jörn
>>>
>>
>>

Re: Size of training data

Posted by William Colen <wi...@gmail.com>.

>From command line you can specify memory using

MAVEN_OPTS="-Xmx4048m"

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov <
svetoslav.marinov@findwise.com> wrote:

> I use the API. Can one specify the memory size via the command line? I
> think the default there is 1024M? At 8G memory during "computing event
> counts...", at 16G during indexing: "Computing event counts...  done.
> 50153300 events
>         IndexingŠ"
>
> Svetoslav
>
> On 2013-04-26 09:12, "Jörn Kottmann" <ko...@gmail.com> wrote:
>
> >On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
> >> I'm wondering what is the max size (if such exists) for training a NER
> >>model? I have a corpus of 2 600 000 sentences annotated with just one
> >>category, 310M in size. However, the training never finishes  8G memory
> >>resulted in java out of memory exception, and when I increased it to 16G
> >>it just died with no error message.
> >
> >Do you use the command line interface or the API for the training?
> >At which stage of the training did you get the out of memory exception?
> >Where did it just die when you used 16G of memory (maybe do a jstack) ?
> >
> >Jörn
> >
>
>
>

Re: Size of training data

Posted by Svetoslav Marinov <sv...@findwise.com>.

I use the API. Can one specify the memory size via the command line? I
think the default there is 1024M? At 8G memory during "computing event
counts...", at 16G during indexing: "Computing event counts...  done.
50153300 events
	IndexingŠ"

Svetoslav

On 2013-04-26 09:12, "Jörn Kottmann" <ko...@gmail.com> wrote:

>On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
>> I'm wondering what is the max size (if such exists) for training a NER
>>model? I have a corpus of 2 600 000 sentences annotated with just one
>>category, 310M in size. However, the training never finishes  8G memory
>>resulted in java out of memory exception, and when I increased it to 16G
>>it just died with no error message.
>
>Do you use the command line interface or the API for the training?
>At which stage of the training did you get the out of memory exception?
>Where did it just die when you used 16G of memory (maybe do a jstack) ?
>
>Jörn
>

Re: Size of training data

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
> I'm wondering what is the max size (if such exists) for training a NER model? I have a corpus of 2 600 000 sentences annotated with just one category, 310M in size. However, the training never finishes – 8G memory resulted in java out of memory exception, and when I increased it to 16G it just died with no error message.

Do you use the command line interface or the API for the training?
At which stage of the training did you get the out of memory exception?
Where did it just die when you used 16G of memory (maybe do a jstack) ?

Jörn