You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/11/16 15:56:30 UTC

Reduce QueryComponent prepare time

Hi,

We're seeing high prepare times for the QueryComponent, obviously due to the vast amount of field and queries. It's common to have a prepare time of 70-80ms while the process times drop significantly due to warmed searchers, OS cache etc. The prepare time is a recurring issue and i'd hope if there are people here that can share some thoughts or hints.

We're using a recent check out on a 10 node test cluster with SSD's (although this is no IO issue) and edismax on about a hundred different fields, this includes phrase searches over most of those fields and SpanFirst queries on about 25 fields.  We'd like to see how we can avoid doing the same prepare procedure over and over again ;)

Thanks,
Markus

RE: Reduce QueryComponent prepare time

Posted by Markus Jelsma <ma...@openindex.io>.

Hi Mikhail,

Thanks for sharing your experiences. I'll look into the flexible query parser.

Markus
 
 
-----Original message-----
> From:Mikhail Khludnev <mk...@griddynamics.com>
> Sent: Tue 20-Nov-2012 19:53
> To: solr-user@lucene.apache.org
> Subject: Re: Reduce QueryComponent prepare time
> 
> Markus,
> 
> It seems you faced the challenge of optimizing complex eDisMax code for
> your particular usecase, which is not so common. I can not help with these
> coding, just can share some experience: we have mind blowing queries too -
> they spawns many fields and enumerate many phrase shingles. We have similar
> contra intuitive hot spot - query parsing takes more than searching and
> faceting. But for our case dictionaries lookup - i.e. terms substitution
> and transformations are the main CPU consumption. We build our own query
> parser with something like
> http://lucene.apache.org/core/4_0_0-ALPHA/queryparser/org/apache/lucene/queryparser/flexible/core/package-summary.html.
> This way, when you represent core query structure as a DOM-like nodes
> skeleton, and then transform them into particular queries instances, *might
> be more performant* (and *might be not* for you) than current eDismax.
> Nothing more useful from me.
> 
> Bye.
> 
> 
> On Tue, Nov 20, 2012 at 7:01 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > Hi,
> >
> > Profiling pointed me directly to the method i already suspected:
> > ExtendedDismaxQParser.parse(). I added manual timers in parts of the method
> > and made sure the timers add up to the QueryComponent prepare time. After
> > starting Solr there's one small part taking almost 100ms on a fast machine
> > with lots of memory, fortunately this is only once. KStemmer and the
> > loading of the KStemData and the ThaiWordFilter's init take the bulk of it.
> >
> >       ExtendedSolrQueryParser up =
> >         new ExtendedSolrQueryParser(this, IMPOSSIBLE_FIELD_NAME);
> >       up.addAlias(IMPOSSIBLE_FIELD_NAME,
> >                 tiebreaker, queryFields);
> >       addAliasesFromRequest(up, tiebreaker);
> >       up.setPhraseSlop(qslop);     // slop for explicit user phrase queries
> >       up.setAllowLeadingWildcard(true);
> >
> > After it's been running for some time two parts continue to take a lot of
> > time, parsing the query
> >
> >       if (parsedUserQuery == null) {
> >         sb = new StringBuilder();
> >         for (Clause clause : clauses) {
> >
> >         ....
> >
> >         if (parsedUserQuery instanceof BooleanQuery) {
> >           BooleanQuery t = new BooleanQuery();
> >           SolrPluginUtils.flattenBooleanQuery(t,
> > (BooleanQuery)parsedUserQuery);
> >           SolrPluginUtils.setMinShouldMatch(t, minShouldMatch);
> >           parsedUserQuery = t;
> >         }
> >       }
> >
> > and handing the phrase fields (pf, pf2, pf3):
> >
> >       if (allPhraseFields.size() > 0) {
> >         // full phrase and shingles
> >         for (FieldParams phraseField: allPhraseFields) {
> >           Map<String,Float> pf = new HashMap<String,Float>(1);
> >           pf.put(phraseField.getField(),phraseField.getBoost());
> >           addShingledPhraseQueries(query, normalClauses, pf,
> >           phraseField.getWordGrams(),tiebreaker, phraseField.getSlop());
> >         }
> >       }
> >
> > The problem is significant when having a lot of fields, the prepare time
> > is usually higher than the process times of query, highlight and facet
> > combined.
> >
> >
> >
> > -----Original message-----
> > > From:Mikhail Khludnev <mk...@griddynamics.com>
> > > Sent: Mon 19-Nov-2012 12:52
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Reduce QueryComponent prepare time
> > >
> > > Markus,
> > >
> > > It's hard to suggest anything until you provide a profiler snapshot which
> > > says what it spends time in prepare for. As far as I know in prepare it
> > > parses queries e.g. we have a really heavy query parsers, but I don't
> > think
> > > it's really common.
> > >
> > >
> > > On Mon, Nov 19, 2012 at 3:08 PM, Markus Jelsma
> > > <ma...@openindex.io>wrote:
> > >
> > > > I'd also like to know which parts of the entire query constitute the
> > > > prepare time and if it would matter significantly if we extend the
> > edismax
> > > > plugin and hardcode the parameters we pass into (reusable) objects.
> > > >
> > > > Thanks,
> > > > Markus
> > > >
> > > > -----Original message-----
> > > > > From:Markus Jelsma <ma...@openindex.io>
> > > > > Sent: Fri 16-Nov-2012 15:57
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Reduce QueryComponent prepare time
> > > > >
> > > > > Hi,
> > > > >
> > > > > We're seeing high prepare times for the QueryComponent, obviously
> > due to
> > > > the vast amount of field and queries. It's common to have a prepare
> > time of
> > > > 70-80ms while the process times drop significantly due to warmed
> > searchers,
> > > > OS cache etc. The prepare time is a recurring issue and i'd hope if
> > there
> > > > are people here that can share some thoughts or hints.
> > > > >
> > > > > We're using a recent check out on a 10 node test cluster with SSD's
> > > > (although this is no IO issue) and edismax on about a hundred different
> > > > fields, this includes phrase searches over most of those fields and
> > > > SpanFirst queries on about 25 fields.  We'd like to see how we can
> > avoid
> > > > doing the same prepare procedure over and over again ;)
> > > > >
> > > > > Thanks,
> > > > > Markus
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > >  <mk...@griddynamics.com>
> > >
> >
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
> 
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: Reduce QueryComponent prepare time

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Markus,

It seems you faced the challenge of optimizing complex eDisMax code for
your particular usecase, which is not so common. I can not help with these
coding, just can share some experience: we have mind blowing queries too -
they spawns many fields and enumerate many phrase shingles. We have similar
contra intuitive hot spot - query parsing takes more than searching and
faceting. But for our case dictionaries lookup - i.e. terms substitution
and transformations are the main CPU consumption. We build our own query
parser with something like
http://lucene.apache.org/core/4_0_0-ALPHA/queryparser/org/apache/lucene/queryparser/flexible/core/package-summary.html.
This way, when you represent core query structure as a DOM-like nodes
skeleton, and then transform them into particular queries instances, *might
be more performant* (and *might be not* for you) than current eDismax.
Nothing more useful from me.

Bye.


On Tue, Nov 20, 2012 at 7:01 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi,
>
> Profiling pointed me directly to the method i already suspected:
> ExtendedDismaxQParser.parse(). I added manual timers in parts of the method
> and made sure the timers add up to the QueryComponent prepare time. After
> starting Solr there's one small part taking almost 100ms on a fast machine
> with lots of memory, fortunately this is only once. KStemmer and the
> loading of the KStemData and the ThaiWordFilter's init take the bulk of it.
>
>       ExtendedSolrQueryParser up =
>         new ExtendedSolrQueryParser(this, IMPOSSIBLE_FIELD_NAME);
>       up.addAlias(IMPOSSIBLE_FIELD_NAME,
>                 tiebreaker, queryFields);
>       addAliasesFromRequest(up, tiebreaker);
>       up.setPhraseSlop(qslop);     // slop for explicit user phrase queries
>       up.setAllowLeadingWildcard(true);
>
> After it's been running for some time two parts continue to take a lot of
> time, parsing the query
>
>       if (parsedUserQuery == null) {
>         sb = new StringBuilder();
>         for (Clause clause : clauses) {
>
>         ....
>
>         if (parsedUserQuery instanceof BooleanQuery) {
>           BooleanQuery t = new BooleanQuery();
>           SolrPluginUtils.flattenBooleanQuery(t,
> (BooleanQuery)parsedUserQuery);
>           SolrPluginUtils.setMinShouldMatch(t, minShouldMatch);
>           parsedUserQuery = t;
>         }
>       }
>
> and handing the phrase fields (pf, pf2, pf3):
>
>       if (allPhraseFields.size() > 0) {
>         // full phrase and shingles
>         for (FieldParams phraseField: allPhraseFields) {
>           Map<String,Float> pf = new HashMap<String,Float>(1);
>           pf.put(phraseField.getField(),phraseField.getBoost());
>           addShingledPhraseQueries(query, normalClauses, pf,
>           phraseField.getWordGrams(),tiebreaker, phraseField.getSlop());
>         }
>       }
>
> The problem is significant when having a lot of fields, the prepare time
> is usually higher than the process times of query, highlight and facet
> combined.
>
>
>
> -----Original message-----
> > From:Mikhail Khludnev <mk...@griddynamics.com>
> > Sent: Mon 19-Nov-2012 12:52
> > To: solr-user@lucene.apache.org
> > Subject: Re: Reduce QueryComponent prepare time
> >
> > Markus,
> >
> > It's hard to suggest anything until you provide a profiler snapshot which
> > says what it spends time in prepare for. As far as I know in prepare it
> > parses queries e.g. we have a really heavy query parsers, but I don't
> think
> > it's really common.
> >
> >
> > On Mon, Nov 19, 2012 at 3:08 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > I'd also like to know which parts of the entire query constitute the
> > > prepare time and if it would matter significantly if we extend the
> edismax
> > > plugin and hardcode the parameters we pass into (reusable) objects.
> > >
> > > Thanks,
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Markus Jelsma <ma...@openindex.io>
> > > > Sent: Fri 16-Nov-2012 15:57
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Reduce QueryComponent prepare time
> > > >
> > > > Hi,
> > > >
> > > > We're seeing high prepare times for the QueryComponent, obviously
> due to
> > > the vast amount of field and queries. It's common to have a prepare
> time of
> > > 70-80ms while the process times drop significantly due to warmed
> searchers,
> > > OS cache etc. The prepare time is a recurring issue and i'd hope if
> there
> > > are people here that can share some thoughts or hints.
> > > >
> > > > We're using a recent check out on a 10 node test cluster with SSD's
> > > (although this is no IO issue) and edismax on about a hundred different
> > > fields, this includes phrase searches over most of those fields and
> > > SpanFirst queries on about 25 fields.  We'd like to see how we can
> avoid
> > > doing the same prepare procedure over and over again ;)
> > > >
> > > > Thanks,
> > > > Markus
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mk...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

RE: Reduce QueryComponent prepare time

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

Profiling pointed me directly to the method i already suspected: ExtendedDismaxQParser.parse(). I added manual timers in parts of the method and made sure the timers add up to the QueryComponent prepare time. After starting Solr there's one small part taking almost 100ms on a fast machine with lots of memory, fortunately this is only once. KStemmer and the loading of the KStemData and the ThaiWordFilter's init take the bulk of it.

      ExtendedSolrQueryParser up =
        new ExtendedSolrQueryParser(this, IMPOSSIBLE_FIELD_NAME);
      up.addAlias(IMPOSSIBLE_FIELD_NAME,
                tiebreaker, queryFields);
      addAliasesFromRequest(up, tiebreaker);
      up.setPhraseSlop(qslop);     // slop for explicit user phrase queries
      up.setAllowLeadingWildcard(true);

After it's been running for some time two parts continue to take a lot of time, parsing the query

      if (parsedUserQuery == null) {
        sb = new StringBuilder();
        for (Clause clause : clauses) {

        ....

        if (parsedUserQuery instanceof BooleanQuery) {
          BooleanQuery t = new BooleanQuery();
          SolrPluginUtils.flattenBooleanQuery(t, (BooleanQuery)parsedUserQuery);
          SolrPluginUtils.setMinShouldMatch(t, minShouldMatch);
          parsedUserQuery = t;
        }
      }

and handing the phrase fields (pf, pf2, pf3):

      if (allPhraseFields.size() > 0) {
        // full phrase and shingles
        for (FieldParams phraseField: allPhraseFields) {
          Map<String,Float> pf = new HashMap<String,Float>(1);
          pf.put(phraseField.getField(),phraseField.getBoost());
          addShingledPhraseQueries(query, normalClauses, pf,
          phraseField.getWordGrams(),tiebreaker, phraseField.getSlop());
        }
      }

The problem is significant when having a lot of fields, the prepare time is usually higher than the process times of query, highlight and facet combined.


 
-----Original message-----
> From:Mikhail Khludnev <mk...@griddynamics.com>
> Sent: Mon 19-Nov-2012 12:52
> To: solr-user@lucene.apache.org
> Subject: Re: Reduce QueryComponent prepare time
> 
> Markus,
> 
> It's hard to suggest anything until you provide a profiler snapshot which
> says what it spends time in prepare for. As far as I know in prepare it
> parses queries e.g. we have a really heavy query parsers, but I don't think
> it's really common.
> 
> 
> On Mon, Nov 19, 2012 at 3:08 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > I'd also like to know which parts of the entire query constitute the
> > prepare time and if it would matter significantly if we extend the edismax
> > plugin and hardcode the parameters we pass into (reusable) objects.
> >
> > Thanks,
> > Markus
> >
> > -----Original message-----
> > > From:Markus Jelsma <ma...@openindex.io>
> > > Sent: Fri 16-Nov-2012 15:57
> > > To: solr-user@lucene.apache.org
> > > Subject: Reduce QueryComponent prepare time
> > >
> > > Hi,
> > >
> > > We're seeing high prepare times for the QueryComponent, obviously due to
> > the vast amount of field and queries. It's common to have a prepare time of
> > 70-80ms while the process times drop significantly due to warmed searchers,
> > OS cache etc. The prepare time is a recurring issue and i'd hope if there
> > are people here that can share some thoughts or hints.
> > >
> > > We're using a recent check out on a 10 node test cluster with SSD's
> > (although this is no IO issue) and edismax on about a hundred different
> > fields, this includes phrase searches over most of those fields and
> > SpanFirst queries on about 25 fields.  We'd like to see how we can avoid
> > doing the same prepare procedure over and over again ;)
> > >
> > > Thanks,
> > > Markus
> > >
> >
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
> 
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: Reduce QueryComponent prepare time

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Markus,

It's hard to suggest anything until you provide a profiler snapshot which
says what it spends time in prepare for. As far as I know in prepare it
parses queries e.g. we have a really heavy query parsers, but I don't think
it's really common.


On Mon, Nov 19, 2012 at 3:08 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> I'd also like to know which parts of the entire query constitute the
> prepare time and if it would matter significantly if we extend the edismax
> plugin and hardcode the parameters we pass into (reusable) objects.
>
> Thanks,
> Markus
>
> -----Original message-----
> > From:Markus Jelsma <ma...@openindex.io>
> > Sent: Fri 16-Nov-2012 15:57
> > To: solr-user@lucene.apache.org
> > Subject: Reduce QueryComponent prepare time
> >
> > Hi,
> >
> > We're seeing high prepare times for the QueryComponent, obviously due to
> the vast amount of field and queries. It's common to have a prepare time of
> 70-80ms while the process times drop significantly due to warmed searchers,
> OS cache etc. The prepare time is a recurring issue and i'd hope if there
> are people here that can share some thoughts or hints.
> >
> > We're using a recent check out on a 10 node test cluster with SSD's
> (although this is no IO issue) and edismax on about a hundred different
> fields, this includes phrase searches over most of those fields and
> SpanFirst queries on about 25 fields.  We'd like to see how we can avoid
> doing the same prepare procedure over and over again ;)
> >
> > Thanks,
> > Markus
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

RE: Reduce QueryComponent prepare time

Posted by Markus Jelsma <ma...@openindex.io>.

I'd also like to know which parts of the entire query constitute the prepare time and if it would matter significantly if we extend the edismax plugin and hardcode the parameters we pass into (reusable) objects.

Thanks,
Markus
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Fri 16-Nov-2012 15:57
> To: solr-user@lucene.apache.org
> Subject: Reduce QueryComponent prepare time
> 
> Hi,
> 
> We're seeing high prepare times for the QueryComponent, obviously due to the vast amount of field and queries. It's common to have a prepare time of 70-80ms while the process times drop significantly due to warmed searchers, OS cache etc. The prepare time is a recurring issue and i'd hope if there are people here that can share some thoughts or hints.
> 
> We're using a recent check out on a 10 node test cluster with SSD's (although this is no IO issue) and edismax on about a hundred different fields, this includes phrase searches over most of those fields and SpanFirst queries on about 25 fields.  We'd like to see how we can avoid doing the same prepare procedure over and over again ;)
> 
> Thanks,
> Markus
>