You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2015/07/23 23:54:35 UTC

Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Hi

I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
usage by IndexSchema. This Solr in particular has one collection with 64
shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
~20 of them are of the same field type (text_general) and is serving around
700 concurrent users (peak), with a thread pool limit of 1000.

Reducing the thread-pool size is something they've tried, but the load is
high and the server keeps up fine with the load, and a thread pool that
size.

What surprised me is that they report obscene numbers they see in the heap:
680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
thought that a TokenStreamComponents can be (and is) reused for all fields
in a document. And so even if we hold a ThreadLocal per
TokenStreamComponents, we should see 1000 of them at the most - per
Analyzer. And as I said, the analyzed fields are of type text_general, and
the rest of the fields are numeric, DV, String, Bool etc. (aka
not-analyzed).

Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:

64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
be they served less than 700 users when the heap dump was taken).

And if each such instance holds a zzBuffer of size 8KB, this amounts to
>7GB of heap space!

Per Analyzer's constructor (which takes ReuseStrategy):

  /**
   * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
   * <p>




*   * NOTE: if you just want to reuse on a per-field basis, it's easier
to   * use a subclass of {@link AnalyzerWrapper} such as    * <a
href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
* PerFieldAnalyerWrapper</a> instead.*   */

However, AnalyzerWrapper's documentation somewhat contradicts it (I think):

  /**
   * Creates a new AnalyzerWrapper with the given reuse strategy.
   * <p>If you want to wrap a single delegate Analyzer you can probably
   * reuse its strategy when instantiating this subclass:
   * {@code super(delegate.getReuseStrategy());}.

*   * <p>If you choose different analyzers per field, use   * {@link
#PER_FIELD_REUSE_STRATEGY}.*
   * @see #getReuseStrategy()
   */

Maybe it is correct for AW, but not for DelegatingAW?

>From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
since SolrIndexAnalyzer returns different Analyzers for different fields
(per their field-type). But all fields that share the same Analyzer
instance should be safe reusing its TokenStreamComponents, since we never
process fields in parallel?

To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
for different fields), but it's the only piece of the puzzle that confuses
me, since I trust whoever wrote this class to understand this stuff better
than I do ...

What do you think?

Shai

RE: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Posted by Shai Erera <se...@gmail.com>.

Ok thanks Uwe, this makes sense now!

Shai
On Jul 24, 2015 9:55 AM, "Uwe Schindler" <uw...@thetaphi.de> wrote:

> In 4.10, SolrIndexAnalyzer extends DelegatingAnalyzerWrapper instead of
> AnalyzerWrapper.
>
>
>
> DelegatingAnalyzerWrapper has its own „ReuseStrategy“. The perField one
> here and is used as fallback only (for incompatible configurations, e.g.
> when one of the per-field configs wrap with a filter or charfilter – but
> this does not happen in Solr for fields). See the patch:
> https://issues.apache.org/jira/secure/attachment/12654117/LUCENE-5803.patch
>
>
>
> It is very important that you use the PER_FIELD one as fallback strategy,
> because otherwise it would break stuff like AnalysisRequestHandler (because
> this one wraps). The Analyzer works per field, so any unknown delegate must
> be cached per field.
>
>
>
> The idea of LUCENE-5803 is to also delegate the “caching”. If the
> SolrAnalyzer does not wrap components, it can also delegate the caching.
> The wrapper’s reuse strategy is then unused. The delegate, FieldType’s
> Analyzer, uses TokenizerChain as Analyzer, which is GLOBAL_REUSE, so each
> FieldType caches globally, no matter how many field instances..
>
>
>
> So all is fine, it is just 4.7 where this optimization is not used.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Friday, July 24, 2015 8:39 AM
> *To:* dev@lucene.apache.org
> *Subject:* Re: Why do SolrIndex/QueryAnalyzers use
> PER_FIELD_REUSE_STRATEGY
>
>
>
> Thanks Shalin, but I reviewed the code in trunk, and it still passes
> PER_FIELD. I can double check but I'm pretty sure that's what I saw.
>
> Shai
>
> On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <sh...@gmail.com>
> wrote:
>
> Uwe fixed this in 4.10 with LUCENE-5803. Now we use
> GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
> create field types per node instead of per core for more savings.
>
> On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> > usage by IndexSchema. This Solr in particular has one collection with 64
> > shards (2 replicas, but 64 cores on one node). The schema has ~120
> fields,
> > ~20 of them are of the same field type (text_general) and is serving
> around
> > 700 concurrent users (peak), with a thread pool limit of 1000.
> >
> > Reducing the thread-pool size is something they've tried, but the load is
> > high and the server keeps up fine with the load, and a thread pool that
> > size.
> >
> > What surprised me is that they report obscene numbers they see in the
> heap:
> > 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> > coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> > thought that a TokenStreamComponents can be (and is) reused for all
> fields
> > in a document. And so even if we hold a ThreadLocal per
> > TokenStreamComponents, we should see 1000 of them at the most - per
> > Analyzer. And as I said, the analyzed fields are of type text_general,
> and
> > the rest of the fields are numeric, DV, String, Bool etc. (aka
> > not-analyzed).
> >
> > Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> > DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> > SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> > PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the
> heap:
> >
> > 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but
> could
> > be they served less than 700 users when the heap dump was taken).
> >
> > And if each such instance holds a zzBuffer of size 8KB, this amounts to
> >7GB
> > of heap space!
> >
> > Per Analyzer's constructor (which takes ReuseStrategy):
> >
> >   /**
> >    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
> >    * <p>
> >    * NOTE: if you just want to reuse on a per-field basis, it's easier to
> >    * use a subclass of {@link AnalyzerWrapper} such as
> >    * <a
> > href="
> {@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html
> ">
> >    * PerFieldAnalyerWrapper</a> instead.
> >    */
> >
> > However, AnalyzerWrapper's documentation somewhat contradicts it (I
> think):
> >
> >   /**
> >    * Creates a new AnalyzerWrapper with the given reuse strategy.
> >    * <p>If you want to wrap a single delegate Analyzer you can probably
> >    * reuse its strategy when instantiating this subclass:
> >    * {@code super(delegate.getReuseStrategy());}.
> >    * <p>If you choose different analyzers per field, use
> >    * {@link #PER_FIELD_REUSE_STRATEGY}.
> >    * @see #getReuseStrategy()
> >    */
> >
> > Maybe it is correct for AW, but not for DelegatingAW?
> >
> > From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> > since SolrIndexAnalyzer returns different Analyzers for different fields
> > (per their field-type). But all fields that share the same Analyzer
> instance
> > should be safe reusing its TokenStreamComponents, since we never process
> > fields in parallel?
> >
> > To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> > PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer
> instances
> > for different fields), but it's the only piece of the puzzle that
> confuses
> > me, since I trust whoever wrote this class to understand this stuff
> better
> > than I do ...
> >
> > What do you think?
> >
> > Shai
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

RE: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Posted by Uwe Schindler <uw...@thetaphi.de>.

In 4.10, SolrIndexAnalyzer extends DelegatingAnalyzerWrapper instead of AnalyzerWrapper.

 

DelegatingAnalyzerWrapper has its own „ReuseStrategy“. The perField one here and is used as fallback only (for incompatible configurations, e.g. when one of the per-field configs wrap with a filter or charfilter – but this does not happen in Solr for fields). See the patch: https://issues.apache.org/jira/secure/attachment/12654117/LUCENE-5803.patch

 

It is very important that you use the PER_FIELD one as fallback strategy, because otherwise it would break stuff like AnalysisRequestHandler (because this one wraps). The Analyzer works per field, so any unknown delegate must be cached per field. 

 

The idea of LUCENE-5803 is to also delegate the “caching”. If the SolrAnalyzer does not wrap components, it can also delegate the caching. The wrapper’s reuse strategy is then unused. The delegate, FieldType’s Analyzer, uses TokenizerChain as Analyzer, which is GLOBAL_REUSE, so each FieldType caches globally, no matter how many field instances..

 

So all is fine, it is just 4.7 where this optimization is not used.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Friday, July 24, 2015 8:39 AM
To: dev@lucene.apache.org
Subject: Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

 

Thanks Shalin, but I reviewed the code in trunk, and it still passes PER_FIELD. I can double check but I'm pretty sure that's what I saw. 

Shai

On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <sh...@gmail.com> wrote:

Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.

On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html <mailto:%7b@docRoot%7d/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html> ">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Posted by Shai Erera <se...@gmail.com>.

What I also want to double check is, after I sent the email I thought about
it some more, and I think the PER_FIELD is passed as a fallback strategy,
but otherwise it uses the wrapped analyzer's strategy. Maybe in 4.7 before
Uwe fixed things some Analyzers still returned PER_FIELD, but now they
don't anymore. I will double check that too.

Shai
On Jul 24, 2015 9:39 AM, "Shai Erera" <se...@gmail.com> wrote:

> Thanks Shalin, but I reviewed the code in trunk, and it still passes
> PER_FIELD. I can double check but I'm pretty sure that's what I saw.
>
> Shai
> On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <sh...@gmail.com>
> wrote:
>
>> Uwe fixed this in 4.10 with LUCENE-5803. Now we use
>> GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
>> create field types per node instead of per core for more savings.
>>
>> On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <se...@gmail.com> wrote:
>> > Hi
>> >
>> > I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
>> > usage by IndexSchema. This Solr in particular has one collection with 64
>> > shards (2 replicas, but 64 cores on one node). The schema has ~120
>> fields,
>> > ~20 of them are of the same field type (text_general) and is serving
>> around
>> > 700 concurrent users (peak), with a thread pool limit of 1000.
>> >
>> > Reducing the thread-pool size is something they've tried, but the load
>> is
>> > high and the server keeps up fine with the load, and a thread pool that
>> > size.
>> >
>> > What surprised me is that they report obscene numbers they see in the
>> heap:
>> > 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
>> > coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
>> > thought that a TokenStreamComponents can be (and is) reused for all
>> fields
>> > in a document. And so even if we hold a ThreadLocal per
>> > TokenStreamComponents, we should see 1000 of them at the most - per
>> > Analyzer. And as I said, the analyzed fields are of type text_general,
>> and
>> > the rest of the fields are numeric, DV, String, Bool etc. (aka
>> > not-analyzed).
>> >
>> > Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
>> > DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
>> > SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy
>> ==
>> > PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the
>> heap:
>> >
>> > 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but
>> could
>> > be they served less than 700 users when the heap dump was taken).
>> >
>> > And if each such instance holds a zzBuffer of size 8KB, this amounts to
>> >7GB
>> > of heap space!
>> >
>> > Per Analyzer's constructor (which takes ReuseStrategy):
>> >
>> >   /**
>> >    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>> >    * <p>
>> >    * NOTE: if you just want to reuse on a per-field basis, it's easier
>> to
>> >    * use a subclass of {@link AnalyzerWrapper} such as
>> >    * <a
>> >
>> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
>> >    * PerFieldAnalyerWrapper</a> instead.
>> >    */
>> >
>> > However, AnalyzerWrapper's documentation somewhat contradicts it (I
>> think):
>> >
>> >   /**
>> >    * Creates a new AnalyzerWrapper with the given reuse strategy.
>> >    * <p>If you want to wrap a single delegate Analyzer you can probably
>> >    * reuse its strategy when instantiating this subclass:
>> >    * {@code super(delegate.getReuseStrategy());}.
>> >    * <p>If you choose different analyzers per field, use
>> >    * {@link #PER_FIELD_REUSE_STRATEGY}.
>> >    * @see #getReuseStrategy()
>> >    */
>> >
>> > Maybe it is correct for AW, but not for DelegatingAW?
>> >
>> > From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
>> > since SolrIndexAnalyzer returns different Analyzers for different fields
>> > (per their field-type). But all fields that share the same Analyzer
>> instance
>> > should be safe reusing its TokenStreamComponents, since we never process
>> > fields in parallel?
>> >
>> > To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
>> > PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer
>> instances
>> > for different fields), but it's the only piece of the puzzle that
>> confuses
>> > me, since I trust whoever wrote this class to understand this stuff
>> better
>> > than I do ...
>> >
>> > What do you think?
>> >
>> > Shai
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Posted by Shai Erera <se...@gmail.com>.

Thanks Shalin, but I reviewed the code in trunk, and it still passes
PER_FIELD. I can double check but I'm pretty sure that's what I saw.

Shai
On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <sh...@gmail.com>
wrote:

> Uwe fixed this in 4.10 with LUCENE-5803. Now we use
> GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
> create field types per node instead of per core for more savings.
>
> On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> > usage by IndexSchema. This Solr in particular has one collection with 64
> > shards (2 replicas, but 64 cores on one node). The schema has ~120
> fields,
> > ~20 of them are of the same field type (text_general) and is serving
> around
> > 700 concurrent users (peak), with a thread pool limit of 1000.
> >
> > Reducing the thread-pool size is something they've tried, but the load is
> > high and the server keeps up fine with the load, and a thread pool that
> > size.
> >
> > What surprised me is that they report obscene numbers they see in the
> heap:
> > 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> > coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> > thought that a TokenStreamComponents can be (and is) reused for all
> fields
> > in a document. And so even if we hold a ThreadLocal per
> > TokenStreamComponents, we should see 1000 of them at the most - per
> > Analyzer. And as I said, the analyzed fields are of type text_general,
> and
> > the rest of the fields are numeric, DV, String, Bool etc. (aka
> > not-analyzed).
> >
> > Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> > DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> > SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> > PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the
> heap:
> >
> > 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but
> could
> > be they served less than 700 users when the heap dump was taken).
> >
> > And if each such instance holds a zzBuffer of size 8KB, this amounts to
> >7GB
> > of heap space!
> >
> > Per Analyzer's constructor (which takes ReuseStrategy):
> >
> >   /**
> >    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
> >    * <p>
> >    * NOTE: if you just want to reuse on a per-field basis, it's easier to
> >    * use a subclass of {@link AnalyzerWrapper} such as
> >    * <a
> >
> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
> >    * PerFieldAnalyerWrapper</a> instead.
> >    */
> >
> > However, AnalyzerWrapper's documentation somewhat contradicts it (I
> think):
> >
> >   /**
> >    * Creates a new AnalyzerWrapper with the given reuse strategy.
> >    * <p>If you want to wrap a single delegate Analyzer you can probably
> >    * reuse its strategy when instantiating this subclass:
> >    * {@code super(delegate.getReuseStrategy());}.
> >    * <p>If you choose different analyzers per field, use
> >    * {@link #PER_FIELD_REUSE_STRATEGY}.
> >    * @see #getReuseStrategy()
> >    */
> >
> > Maybe it is correct for AW, but not for DelegatingAW?
> >
> > From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> > since SolrIndexAnalyzer returns different Analyzers for different fields
> > (per their field-type). But all fields that share the same Analyzer
> instance
> > should be safe reusing its TokenStreamComponents, since we never process
> > fields in parallel?
> >
> > To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> > PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer
> instances
> > for different fields), but it's the only piece of the puzzle that
> confuses
> > me, since I trust whoever wrote this class to understand this stuff
> better
> > than I do ...
> >
> > What do you think?
> >
> > Shai
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.

On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



-- 
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org