You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2017/02/08 10:50:42 UTC

Multiple models and String.intern

Hello all,

I often run multiple models in production, often trained on the same data
but with different types (typical name finder scenario). There could be one
model to detect person names, and another to detection locations. The
predicate Strings inside those models are always the same but the models
can't share the same String instance.

I would like to propose that we use String.intern in the model reader to
ensure one string is only loaded once.

We tried that in the past and this caused lots of issues with PermGen
space, but this was improved over time in Java. In Java 8 (on which we
depend now) this should work properly.

Here is an interesting article about it:
http://java-performance.info/string-intern-in-java-6-7-8/

Using String.intern will make the model loading a bit slower (we can
benchmark that).

Jörn

Re: Multiple models and String.intern

Posted by Jeffrey Zemerick <jz...@apache.org>.
I did not know that about StringTableSize. I thought it was more of a hard
limit. That's good to know. Thanks

On Wed, Feb 8, 2017 at 2:16 PM, Joern Kottmann <ko...@gmail.com> wrote:

> The StringTableSize doesn't limit the amount of Strings that can be stored
> in the pool, if the size is too small it just gets slower.
> This would only be done for loading models, querying the model wouldn't be
> affected. The predicate / feature strings would be interned.
>
> Jörn
>
>
>
> On Wed, Feb 8, 2017 at 6:37 PM, Jeffrey Zemerick <jz...@apache.org>
> wrote:
>
> > Would it be possible to have an option or setting somewhere that
> determines
> > if string pooling is used? The option would provide backward
> compatibility
> > in case someone has to adjust the -XX:StringTableSize because their
> > existing models exceed the default JVM limit, and an option would also be
> > useful for cases when the models were made from different data sources.
> > (I'm assuming in that case using string pooling would be detrimental to
> > performance.)
> >
> > Jeff
> >
> >
> > On Wed, Feb 8, 2017 at 5:50 AM, Joern Kottmann <ko...@gmail.com>
> wrote:
> >
> > > Hello all,
> > >
> > > I often run multiple models in production, often trained on the same
> data
> > > but with different types (typical name finder scenario). There could be
> > one
> > > model to detect person names, and another to detection locations. The
> > > predicate Strings inside those models are always the same but the
> models
> > > can't share the same String instance.
> > >
> > > I would like to propose that we use String.intern in the model reader
> to
> > > ensure one string is only loaded once.
> > >
> > > We tried that in the past and this caused lots of issues with PermGen
> > > space, but this was improved over time in Java. In Java 8 (on which we
> > > depend now) this should work properly.
> > >
> > > Here is an interesting article about it:
> > > http://java-performance.info/string-intern-in-java-6-7-8/
> > >
> > > Using String.intern will make the model loading a bit slower (we can
> > > benchmark that).
> > >
> > > Jörn
> > >
> >
>

Re: Multiple models and String.intern

Posted by Joern Kottmann <ko...@gmail.com>.
The StringTableSize doesn't limit the amount of Strings that can be stored
in the pool, if the size is too small it just gets slower.
This would only be done for loading models, querying the model wouldn't be
affected. The predicate / feature strings would be interned.

Jörn



On Wed, Feb 8, 2017 at 6:37 PM, Jeffrey Zemerick <jz...@apache.org>
wrote:

> Would it be possible to have an option or setting somewhere that determines
> if string pooling is used? The option would provide backward compatibility
> in case someone has to adjust the -XX:StringTableSize because their
> existing models exceed the default JVM limit, and an option would also be
> useful for cases when the models were made from different data sources.
> (I'm assuming in that case using string pooling would be detrimental to
> performance.)
>
> Jeff
>
>
> On Wed, Feb 8, 2017 at 5:50 AM, Joern Kottmann <ko...@gmail.com> wrote:
>
> > Hello all,
> >
> > I often run multiple models in production, often trained on the same data
> > but with different types (typical name finder scenario). There could be
> one
> > model to detect person names, and another to detection locations. The
> > predicate Strings inside those models are always the same but the models
> > can't share the same String instance.
> >
> > I would like to propose that we use String.intern in the model reader to
> > ensure one string is only loaded once.
> >
> > We tried that in the past and this caused lots of issues with PermGen
> > space, but this was improved over time in Java. In Java 8 (on which we
> > depend now) this should work properly.
> >
> > Here is an interesting article about it:
> > http://java-performance.info/string-intern-in-java-6-7-8/
> >
> > Using String.intern will make the model loading a bit slower (we can
> > benchmark that).
> >
> > Jörn
> >
>

Re: Multiple models and String.intern

Posted by Jeffrey Zemerick <jz...@apache.org>.
Would it be possible to have an option or setting somewhere that determines
if string pooling is used? The option would provide backward compatibility
in case someone has to adjust the -XX:StringTableSize because their
existing models exceed the default JVM limit, and an option would also be
useful for cases when the models were made from different data sources.
(I'm assuming in that case using string pooling would be detrimental to
performance.)

Jeff


On Wed, Feb 8, 2017 at 5:50 AM, Joern Kottmann <ko...@gmail.com> wrote:

> Hello all,
>
> I often run multiple models in production, often trained on the same data
> but with different types (typical name finder scenario). There could be one
> model to detect person names, and another to detection locations. The
> predicate Strings inside those models are always the same but the models
> can't share the same String instance.
>
> I would like to propose that we use String.intern in the model reader to
> ensure one string is only loaded once.
>
> We tried that in the past and this caused lots of issues with PermGen
> space, but this was improved over time in Java. In Java 8 (on which we
> depend now) this should work properly.
>
> Here is an interesting article about it:
> http://java-performance.info/string-intern-in-java-6-7-8/
>
> Using String.intern will make the model loading a bit slower (we can
> benchmark that).
>
> Jörn
>