You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Ashwin Ramesh <as...@canva.com.INVALID> on 2022/01/12 21:01:19 UTC

"Slow" Query performance with boosts.

Hi everyone,

I have a few questions about how we can improve our solr query performance,
especially for boosts (BF, BQ, boost, etc).

*System Specs:*
Solr Version: 7.7.x
Heap Size: 31gb
Num Docs: >100M
Shards: 8
Replication Factor: 6
Index is completely mapped into memory


Example query:
{
q=hello world
qf=title description keywords
pf=title^0.5
ps=0
fq=type:P
boost:def(boostFieldA,1) // boostFieldA is docValue float type
bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
textField. No docValue, just indexed
rows:500
fl=id,score
}

numFound: >21M
qTime: 800ms

Experimentation of params:

   - When I remove the boost parameter, the qTime drops to 525ms
   - When I remove the bf parameter, the qTime dropes to 650ms
   - When I remove both the boost & bf parameters, the qTime drops to 400ms


Questions:

   1. Is there any way to improve the performance of the boosts (specific
   field types, etc)?
   2. Will sharding further such that each core only has to score a smaller
   subset of documents help with query performance?
   3. Is there any performance impact when boosting/querying against sparse
   fields, both indexed=true or docValues=true?
   4. It seems the base case scoring is 400ms, which is already quite high.
   Is this because the query (hello world) implicitly gets parsed as (hello OR
   world)? Thus it would be more computationally expensive?
   5. Any other advice :) ?


Thanks in advance,

Ash

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates> 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>

Re: "Slow" Query performance with boosts.

Posted by Ashwin Ramesh <as...@canva.com.INVALID>.

Hi everyone,

Just wanted to message again to see if anyone had any advice or answers to
the questions!

Thanks again

On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh <as...@canva.com> wrote:

> Hi everyone,
>
> I have a few questions about how we can improve our solr query
> performance, especially for boosts (BF, BQ, boost, etc).
>
> *System Specs:*
> Solr Version: 7.7.x
> Heap Size: 31gb
> Num Docs: >100M
> Shards: 8
> Replication Factor: 6
> Index is completely mapped into memory
>
>
> Example query:
> {
> q=hello world
> qf=title description keywords
> pf=title^0.5
> ps=0
> fq=type:P
> boost:def(boostFieldA,1) // boostFieldA is docValue float type
> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> textField. No docValue, just indexed
> rows:500
> fl=id,score
> }
>
> numFound: >21M
> qTime: 800ms
>
> Experimentation of params:
>
>    - When I remove the boost parameter, the qTime drops to 525ms
>    - When I remove the bf parameter, the qTime dropes to 650ms
>    - When I remove both the boost & bf parameters, the qTime drops to
>    400ms
>
>
> Questions:
>
>    1. Is there any way to improve the performance of the boosts (specific
>    field types, etc)?
>    2. Will sharding further such that each core only has to score a
>    smaller subset of documents help with query performance?
>    3. Is there any performance impact when boosting/querying against
>    sparse fields, both indexed=true or docValues=true?
>    4. It seems the base case scoring is 400ms, which is already quite
>    high. Is this because the query (hello world) implicitly gets parsed as
>    (hello OR world)? Thus it would be more computationally expensive?
>    5. Any other advice :) ?
>
>
> Thanks in advance,
>
> Ash
>
>
>
>
>
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates> 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>

Re: "Slow" Query performance with boosts.

Posted by Joel Bernstein <jo...@gmail.com>.

One other thing to check is the performance on each node. You can do this
by running the query with the parameter distrib=false on each node. A
distributed search is only as fast as the slowest node. So you'll want to
rule out an underpowered node.


Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, Jan 19, 2022 at 4:13 PM Ashwin Ramesh <as...@canva.com.invalid>
wrote:

> Hi all,
>
> Thanks for the feedback!
>
> @Allesandro - The additive boost is primarily a legacy function that we are
> moving away from. It allowed us to rank specific document higher than the
> remaining potential result set. We are doing this by ensuring the score of
> the documents is a magnitude higher than the remaining scores.
>
> @Charlie & @Joel - I've run another query, which has 21M matches:
>
> 500 rows - 1800ms
> 200 rows -1650ms
> 100 rows - 1500ms
> 10 rows - 900ms
>
> You are correct that the rows has an impact on the latency, however the
> base case is still high! My assumption is that scoring 21M docs (across 8
> shards) is computationally expensive?
>
> Regarding why we ask for 500 results - it is so we can do second-phase
> ranking on the top N (N=500) with features that are not in Solr.
>
> My current hypothesis consists of:
> 1. Our Solr configuration (hardware and/or solrconfig, solr.xml) files are
> misconfigured for our use case
> 2. Our boosting functions & schema (field types) are misconfigured -
> However, after this thread, I'm pretty certain that the fieldtypes that we
> have for the boosts are as optimized as possible.
> 3. We have to change our scoring function such that a given query does not
> match against 20+ Million documents. Probably need to have more AND clauses
> to cut the result set down. This is something we are already working on.
>
> For context on the 3rd point. I changed my query to ensure every term in
> the query is mandatory:
>
> Total scored: 37,000
>
> 500 rows - 30ms
> 10 rows - 30ms (appox the same)
>
> Obviously this can't be done across the board otherwise recall will drop
> too drastically for some query sets.
>
> Regards,
>
> \Ash
>
> Regards,
>
> Ash
>
> On Thu, Jan 20, 2022 at 5:52 AM Alessandro Benedetti <a.benedetti@sease.io
> >
> wrote:
>
> > On top of the already good suggestions to reduce the scope of your
> > experiment, let's see:
> >
> > boost:def(boostFieldA,1) // boostFieldA is docValue float type
> >
> > The first part looks all right to me, it's expensive though,
> independently
> > of the number of rows returned (as the boost request parameter is parsed
> as
> > an additional query that affects the score).
> > Enabling doc-values on such a field is probably the best option you have.
> >
> > In regards to the second part:
> > bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> > textField. No docValue, just indexed
> >
> > This *adds* to the score:
> >
> > Returns the number of times the term appears in the field for that
> > document.
> >
> > termfreq(text,'memory')
> > So I am not even sure how multi term is managed(of course this depends
> also
> > on the tokenization of termScoreFieldB.
> > the* 1000* there smells a lot of bad practice, as you are adding to your
> > score, and your score is not probabilistic, nor limited to a constant
> range
> > of values (the main Lucene score value depends on the query and the
> index).
> > It feels you are likely going to get a better behaviour modelling such
> > requirement as an additional boost query rather then a boost function,
> but
> > I am curious to know what is that you are attempting to do.
> >
> > Cheers
> > --------------------------
> > Alessandro Benedetti
> > Apache Lucene/Solr PMC member and Committer
> > Director, R&D Software Engineer, Search Consultant
> >
> > www.sease.io
> >
> >
> > On Wed, 19 Jan 2022 at 13:44, Joel Bernstein <jo...@gmail.com> wrote:
> >
> > > Testing out a smaller "rows" param is key. Then you can isolate the
> > > performance difference due to the 500 rows. Adding more shards is going
> > to
> > > increase the penalty for having 500 rows, so it's good to understand
> how
> > > big that penalty is.
> > >
> > > Then test out smaller result sets by adjusting the query. Gradually
> > > increase the result set size by adjusting the query. You then can get a
> > > feel for how result set size affects performance. This will give you an
> > > indication how much it will help to have more shards.
> > >
> > >
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Wed, Jan 19, 2022 at 6:19 AM Charlie Hull <
> > > chull@opensourceconnections.com> wrote:
> > >
> > > > Hi Ashwin,
> > > >
> > > > What happens if you reduce the number of rows requested? Do you
> really
> > > > need 500 results each time? I think this will ask for 500 results
> from
> > > > *each shard* too.
> > > > https://solr.apache.org/guide/8_7/pagination-of-results.html
> > > >
> > > > Also it looks like you mean boost=def(boostFieldA,1) not
> > > > boost:def(boostFieldA,1), am I right?
> > > >
> > > > Cheers
> > > >
> > > > Charlie
> > > >
> > > > On 19/01/2022 02:43, Ashwin Ramesh wrote:
> > > > > Gentle ping! Promise it's my final one! :)
> > > > >
> > > > > On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh<as...@canva.com>
> > > wrote:
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I have a few questions about how we can improve our solr query
> > > > >> performance, especially for boosts (BF, BQ, boost, etc).
> > > > >>
> > > > >> *System Specs:*
> > > > >> Solr Version: 7.7.x
> > > > >> Heap Size: 31gb
> > > > >> Num Docs: >100M
> > > > >> Shards: 8
> > > > >> Replication Factor: 6
> > > > >> Index is completely mapped into memory
> > > > >>
> > > > >>
> > > > >> Example query:
> > > > >> {
> > > > >> q=hello world
> > > > >> qf=title description keywords
> > > > >> pf=title^0.5
> > > > >> ps=0
> > > > >> fq=type:P
> > > > >> boost:def(boostFieldA,1) // boostFieldA is docValue float type
> > > > >> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is
> a
> > > > >> textField. No docValue, just indexed
> > > > >> rows:500
> > > > >> fl=id,score
> > > > >> }
> > > > >>
> > > > >> numFound: >21M
> > > > >> qTime: 800ms
> > > > >>
> > > > >> Experimentation of params:
> > > > >>
> > > > >>     - When I remove the boost parameter, the qTime drops to 525ms
> > > > >>     - When I remove the bf parameter, the qTime dropes to 650ms
> > > > >>     - When I remove both the boost & bf parameters, the qTime
> drops
> > to
> > > > >>     400ms
> > > > >>
> > > > >>
> > > > >> Questions:
> > > > >>
> > > > >>     1. Is there any way to improve the performance of the boosts
> > > > (specific
> > > > >>     field types, etc)?
> > > > >>     2. Will sharding further such that each core only has to
> score a
> > > > >>     smaller subset of documents help with query performance?
> > > > >>     3. Is there any performance impact when boosting/querying
> > against
> > > > >>     sparse fields, both indexed=true or docValues=true?
> > > > >>     4. It seems the base case scoring is 400ms, which is already
> > quite
> > > > >>     high. Is this because the query (hello world) implicitly gets
> > > > parsed as
> > > > >>     (hello OR world)? Thus it would be more computationally
> > expensive?
> > > > >>     5. Any other advice :) ?
> > > > >>
> > > > >>
> > > > >> Thanks in advance,
> > > > >>
> > > > >> Ash
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > --
> > > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > > > Founding member of The Search Network <
> http://www.thesearchnetwork.com
> > >
> > > > and co-author of Searching the Enterprise
> > > > <
> > > >
> > >
> >
> https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf
> > > > >
> > > > tel/fax: +44 (0)8700 118334
> > > > mobile: +44 (0)7767 825828
> > > >
> > > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > > > Amtsgericht Charlottenburg | HRB 230712 B
> > > > Geschäftsführer: John M. Woodell | David E. Pugh
> > > > Finanzamt: Berlin Finanzamt für Körperschaften II
> > > >
> > > > --
> > > > This email has been checked for viruses by AVG.
> > > > https://www.avg.com
> > > >
> > >
> >
>
> --
> **
> ** <https://www.canva.com/>Empowering the world to design
> Share accurate
> information on COVID-19 and spread messages of support to your community.
> Here are some resources
> <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
>
> that can help.
>  <https://twitter.com/canva> <https://facebook.com/canva>
> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> <https://instagram.com/canva>
>
>
>
>
>
>
>
>
>
>
>

Re: "Slow" Query performance with boosts.

Posted by Ashwin Ramesh <as...@canva.com.INVALID>.

Hi all,

Thanks for the feedback!

@Allesandro - The additive boost is primarily a legacy function that we are
moving away from. It allowed us to rank specific document higher than the
remaining potential result set. We are doing this by ensuring the score of
the documents is a magnitude higher than the remaining scores.

@Charlie & @Joel - I've run another query, which has 21M matches:

500 rows - 1800ms
200 rows -1650ms
100 rows - 1500ms
10 rows - 900ms

You are correct that the rows has an impact on the latency, however the
base case is still high! My assumption is that scoring 21M docs (across 8
shards) is computationally expensive?

Regarding why we ask for 500 results - it is so we can do second-phase
ranking on the top N (N=500) with features that are not in Solr.

My current hypothesis consists of:
1. Our Solr configuration (hardware and/or solrconfig, solr.xml) files are
misconfigured for our use case
2. Our boosting functions & schema (field types) are misconfigured -
However, after this thread, I'm pretty certain that the fieldtypes that we
have for the boosts are as optimized as possible.
3. We have to change our scoring function such that a given query does not
match against 20+ Million documents. Probably need to have more AND clauses
to cut the result set down. This is something we are already working on.

For context on the 3rd point. I changed my query to ensure every term in
the query is mandatory:

Total scored: 37,000

500 rows - 30ms
10 rows - 30ms (appox the same)

Obviously this can't be done across the board otherwise recall will drop
too drastically for some query sets.

Regards,

\Ash

Regards,

Ash

On Thu, Jan 20, 2022 at 5:52 AM Alessandro Benedetti <a....@sease.io>
wrote:

> On top of the already good suggestions to reduce the scope of your
> experiment, let's see:
>
> boost:def(boostFieldA,1) // boostFieldA is docValue float type
>
> The first part looks all right to me, it's expensive though, independently
> of the number of rows returned (as the boost request parameter is parsed as
> an additional query that affects the score).
> Enabling doc-values on such a field is probably the best option you have.
>
> In regards to the second part:
> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> textField. No docValue, just indexed
>
> This *adds* to the score:
>
> Returns the number of times the term appears in the field for that
> document.
>
> termfreq(text,'memory')
> So I am not even sure how multi term is managed(of course this depends also
> on the tokenization of termScoreFieldB.
> the* 1000* there smells a lot of bad practice, as you are adding to your
> score, and your score is not probabilistic, nor limited to a constant range
> of values (the main Lucene score value depends on the query and the index).
> It feels you are likely going to get a better behaviour modelling such
> requirement as an additional boost query rather then a boost function, but
> I am curious to know what is that you are attempting to do.
>
> Cheers
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr PMC member and Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Wed, 19 Jan 2022 at 13:44, Joel Bernstein <jo...@gmail.com> wrote:
>
> > Testing out a smaller "rows" param is key. Then you can isolate the
> > performance difference due to the 500 rows. Adding more shards is going
> to
> > increase the penalty for having 500 rows, so it's good to understand how
> > big that penalty is.
> >
> > Then test out smaller result sets by adjusting the query. Gradually
> > increase the result set size by adjusting the query. You then can get a
> > feel for how result set size affects performance. This will give you an
> > indication how much it will help to have more shards.
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Wed, Jan 19, 2022 at 6:19 AM Charlie Hull <
> > chull@opensourceconnections.com> wrote:
> >
> > > Hi Ashwin,
> > >
> > > What happens if you reduce the number of rows requested? Do you really
> > > need 500 results each time? I think this will ask for 500 results from
> > > *each shard* too.
> > > https://solr.apache.org/guide/8_7/pagination-of-results.html
> > >
> > > Also it looks like you mean boost=def(boostFieldA,1) not
> > > boost:def(boostFieldA,1), am I right?
> > >
> > > Cheers
> > >
> > > Charlie
> > >
> > > On 19/01/2022 02:43, Ashwin Ramesh wrote:
> > > > Gentle ping! Promise it's my final one! :)
> > > >
> > > > On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh<as...@canva.com>
> > wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> I have a few questions about how we can improve our solr query
> > > >> performance, especially for boosts (BF, BQ, boost, etc).
> > > >>
> > > >> *System Specs:*
> > > >> Solr Version: 7.7.x
> > > >> Heap Size: 31gb
> > > >> Num Docs: >100M
> > > >> Shards: 8
> > > >> Replication Factor: 6
> > > >> Index is completely mapped into memory
> > > >>
> > > >>
> > > >> Example query:
> > > >> {
> > > >> q=hello world
> > > >> qf=title description keywords
> > > >> pf=title^0.5
> > > >> ps=0
> > > >> fq=type:P
> > > >> boost:def(boostFieldA,1) // boostFieldA is docValue float type
> > > >> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> > > >> textField. No docValue, just indexed
> > > >> rows:500
> > > >> fl=id,score
> > > >> }
> > > >>
> > > >> numFound: >21M
> > > >> qTime: 800ms
> > > >>
> > > >> Experimentation of params:
> > > >>
> > > >>     - When I remove the boost parameter, the qTime drops to 525ms
> > > >>     - When I remove the bf parameter, the qTime dropes to 650ms
> > > >>     - When I remove both the boost & bf parameters, the qTime drops
> to
> > > >>     400ms
> > > >>
> > > >>
> > > >> Questions:
> > > >>
> > > >>     1. Is there any way to improve the performance of the boosts
> > > (specific
> > > >>     field types, etc)?
> > > >>     2. Will sharding further such that each core only has to score a
> > > >>     smaller subset of documents help with query performance?
> > > >>     3. Is there any performance impact when boosting/querying
> against
> > > >>     sparse fields, both indexed=true or docValues=true?
> > > >>     4. It seems the base case scoring is 400ms, which is already
> quite
> > > >>     high. Is this because the query (hello world) implicitly gets
> > > parsed as
> > > >>     (hello OR world)? Thus it would be more computationally
> expensive?
> > > >>     5. Any other advice :) ?
> > > >>
> > > >>
> > > >> Thanks in advance,
> > > >>
> > > >> Ash
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > --
> > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > > Founding member of The Search Network <http://www.thesearchnetwork.com
> >
> > > and co-author of Searching the Enterprise
> > > <
> > >
> >
> https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf
> > > >
> > > tel/fax: +44 (0)8700 118334
> > > mobile: +44 (0)7767 825828
> > >
> > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > > Amtsgericht Charlottenburg | HRB 230712 B
> > > Geschäftsführer: John M. Woodell | David E. Pugh
> > > Finanzamt: Berlin Finanzamt für Körperschaften II
> > >
> > > --
> > > This email has been checked for viruses by AVG.
> > > https://www.avg.com
> > >
> >
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates> 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>

Re: "Slow" Query performance with boosts.

Posted by Alessandro Benedetti <a....@sease.io>.

On top of the already good suggestions to reduce the scope of your
experiment, let's see:

boost:def(boostFieldA,1) // boostFieldA is docValue float type

The first part looks all right to me, it's expensive though, independently
of the number of rows returned (as the boost request parameter is parsed as
an additional query that affects the score).
Enabling doc-values on such a field is probably the best option you have.

In regards to the second part:
bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
textField. No docValue, just indexed

This *adds* to the score:

Returns the number of times the term appears in the field for that document.

termfreq(text,'memory')
So I am not even sure how multi term is managed(of course this depends also
on the tokenization of termScoreFieldB.
the* 1000* there smells a lot of bad practice, as you are adding to your
score, and your score is not probabilistic, nor limited to a constant range
of values (the main Lucene score value depends on the query and the index).
It feels you are likely going to get a better behaviour modelling such
requirement as an additional boost query rather then a boost function, but
I am curious to know what is that you are attempting to do.

Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Wed, 19 Jan 2022 at 13:44, Joel Bernstein <jo...@gmail.com> wrote:

> Testing out a smaller "rows" param is key. Then you can isolate the
> performance difference due to the 500 rows. Adding more shards is going to
> increase the penalty for having 500 rows, so it's good to understand how
> big that penalty is.
>
> Then test out smaller result sets by adjusting the query. Gradually
> increase the result set size by adjusting the query. You then can get a
> feel for how result set size affects performance. This will give you an
> indication how much it will help to have more shards.
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Wed, Jan 19, 2022 at 6:19 AM Charlie Hull <
> chull@opensourceconnections.com> wrote:
>
> > Hi Ashwin,
> >
> > What happens if you reduce the number of rows requested? Do you really
> > need 500 results each time? I think this will ask for 500 results from
> > *each shard* too.
> > https://solr.apache.org/guide/8_7/pagination-of-results.html
> >
> > Also it looks like you mean boost=def(boostFieldA,1) not
> > boost:def(boostFieldA,1), am I right?
> >
> > Cheers
> >
> > Charlie
> >
> > On 19/01/2022 02:43, Ashwin Ramesh wrote:
> > > Gentle ping! Promise it's my final one! :)
> > >
> > > On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh<as...@canva.com>
> wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> I have a few questions about how we can improve our solr query
> > >> performance, especially for boosts (BF, BQ, boost, etc).
> > >>
> > >> *System Specs:*
> > >> Solr Version: 7.7.x
> > >> Heap Size: 31gb
> > >> Num Docs: >100M
> > >> Shards: 8
> > >> Replication Factor: 6
> > >> Index is completely mapped into memory
> > >>
> > >>
> > >> Example query:
> > >> {
> > >> q=hello world
> > >> qf=title description keywords
> > >> pf=title^0.5
> > >> ps=0
> > >> fq=type:P
> > >> boost:def(boostFieldA,1) // boostFieldA is docValue float type
> > >> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> > >> textField. No docValue, just indexed
> > >> rows:500
> > >> fl=id,score
> > >> }
> > >>
> > >> numFound: >21M
> > >> qTime: 800ms
> > >>
> > >> Experimentation of params:
> > >>
> > >>     - When I remove the boost parameter, the qTime drops to 525ms
> > >>     - When I remove the bf parameter, the qTime dropes to 650ms
> > >>     - When I remove both the boost & bf parameters, the qTime drops to
> > >>     400ms
> > >>
> > >>
> > >> Questions:
> > >>
> > >>     1. Is there any way to improve the performance of the boosts
> > (specific
> > >>     field types, etc)?
> > >>     2. Will sharding further such that each core only has to score a
> > >>     smaller subset of documents help with query performance?
> > >>     3. Is there any performance impact when boosting/querying against
> > >>     sparse fields, both indexed=true or docValues=true?
> > >>     4. It seems the base case scoring is 400ms, which is already quite
> > >>     high. Is this because the query (hello world) implicitly gets
> > parsed as
> > >>     (hello OR world)? Thus it would be more computationally expensive?
> > >>     5. Any other advice :) ?
> > >>
> > >>
> > >> Thanks in advance,
> > >>
> > >> Ash
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > --
> > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > Founding member of The Search Network <http://www.thesearchnetwork.com>
> > and co-author of Searching the Enterprise
> > <
> >
> https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf
> > >
> > tel/fax: +44 (0)8700 118334
> > mobile: +44 (0)7767 825828
> >
> > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > Amtsgericht Charlottenburg | HRB 230712 B
> > Geschäftsführer: John M. Woodell | David E. Pugh
> > Finanzamt: Berlin Finanzamt für Körperschaften II
> >
> > --
> > This email has been checked for viruses by AVG.
> > https://www.avg.com
> >
>

Re: "Slow" Query performance with boosts.

Posted by Joel Bernstein <jo...@gmail.com>.

Testing out a smaller "rows" param is key. Then you can isolate the
performance difference due to the 500 rows. Adding more shards is going to
increase the penalty for having 500 rows, so it's good to understand how
big that penalty is.

Then test out smaller result sets by adjusting the query. Gradually
increase the result set size by adjusting the query. You then can get a
feel for how result set size affects performance. This will give you an
indication how much it will help to have more shards.





Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, Jan 19, 2022 at 6:19 AM Charlie Hull <
chull@opensourceconnections.com> wrote:

> Hi Ashwin,
>
> What happens if you reduce the number of rows requested? Do you really
> need 500 results each time? I think this will ask for 500 results from
> *each shard* too.
> https://solr.apache.org/guide/8_7/pagination-of-results.html
>
> Also it looks like you mean boost=def(boostFieldA,1) not
> boost:def(boostFieldA,1), am I right?
>
> Cheers
>
> Charlie
>
> On 19/01/2022 02:43, Ashwin Ramesh wrote:
> > Gentle ping! Promise it's my final one! :)
> >
> > On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh<as...@canva.com>  wrote:
> >
> >> Hi everyone,
> >>
> >> I have a few questions about how we can improve our solr query
> >> performance, especially for boosts (BF, BQ, boost, etc).
> >>
> >> *System Specs:*
> >> Solr Version: 7.7.x
> >> Heap Size: 31gb
> >> Num Docs: >100M
> >> Shards: 8
> >> Replication Factor: 6
> >> Index is completely mapped into memory
> >>
> >>
> >> Example query:
> >> {
> >> q=hello world
> >> qf=title description keywords
> >> pf=title^0.5
> >> ps=0
> >> fq=type:P
> >> boost:def(boostFieldA,1) // boostFieldA is docValue float type
> >> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> >> textField. No docValue, just indexed
> >> rows:500
> >> fl=id,score
> >> }
> >>
> >> numFound: >21M
> >> qTime: 800ms
> >>
> >> Experimentation of params:
> >>
> >>     - When I remove the boost parameter, the qTime drops to 525ms
> >>     - When I remove the bf parameter, the qTime dropes to 650ms
> >>     - When I remove both the boost & bf parameters, the qTime drops to
> >>     400ms
> >>
> >>
> >> Questions:
> >>
> >>     1. Is there any way to improve the performance of the boosts
> (specific
> >>     field types, etc)?
> >>     2. Will sharding further such that each core only has to score a
> >>     smaller subset of documents help with query performance?
> >>     3. Is there any performance impact when boosting/querying against
> >>     sparse fields, both indexed=true or docValues=true?
> >>     4. It seems the base case scoring is 400ms, which is already quite
> >>     high. Is this because the query (hello world) implicitly gets
> parsed as
> >>     (hello OR world)? Thus it would be more computationally expensive?
> >>     5. Any other advice :) ?
> >>
> >>
> >> Thanks in advance,
> >>
> >> Ash
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> Founding member of The Search Network <http://www.thesearchnetwork.com>
> and co-author of Searching the Enterprise
> <
> https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf
> >
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>
> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> Amtsgericht Charlottenburg | HRB 230712 B
> Geschäftsführer: John M. Woodell | David E. Pugh
> Finanzamt: Berlin Finanzamt für Körperschaften II
>
> --
> This email has been checked for viruses by AVG.
> https://www.avg.com
>

Re: "Slow" Query performance with boosts.

Posted by Charlie Hull <ch...@opensourceconnections.com>.

Hi Ashwin,

What happens if you reduce the number of rows requested? Do you really 
need 500 results each time? I think this will ask for 500 results from 
*each shard* too. 
https://solr.apache.org/guide/8_7/pagination-of-results.html

Also it looks like you mean boost=def(boostFieldA,1) not 
boost:def(boostFieldA,1), am I right?

Cheers

Charlie

On 19/01/2022 02:43, Ashwin Ramesh wrote:
> Gentle ping! Promise it's my final one! :)
>
> On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh<as...@canva.com>  wrote:
>
>> Hi everyone,
>>
>> I have a few questions about how we can improve our solr query
>> performance, especially for boosts (BF, BQ, boost, etc).
>>
>> *System Specs:*
>> Solr Version: 7.7.x
>> Heap Size: 31gb
>> Num Docs: >100M
>> Shards: 8
>> Replication Factor: 6
>> Index is completely mapped into memory
>>
>>
>> Example query:
>> {
>> q=hello world
>> qf=title description keywords
>> pf=title^0.5
>> ps=0
>> fq=type:P
>> boost:def(boostFieldA,1) // boostFieldA is docValue float type
>> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
>> textField. No docValue, just indexed
>> rows:500
>> fl=id,score
>> }
>>
>> numFound: >21M
>> qTime: 800ms
>>
>> Experimentation of params:
>>
>>     - When I remove the boost parameter, the qTime drops to 525ms
>>     - When I remove the bf parameter, the qTime dropes to 650ms
>>     - When I remove both the boost & bf parameters, the qTime drops to
>>     400ms
>>
>>
>> Questions:
>>
>>     1. Is there any way to improve the performance of the boosts (specific
>>     field types, etc)?
>>     2. Will sharding further such that each core only has to score a
>>     smaller subset of documents help with query performance?
>>     3. Is there any performance impact when boosting/querying against
>>     sparse fields, both indexed=true or docValues=true?
>>     4. It seems the base case scoring is 400ms, which is already quite
>>     high. Is this because the query (hello world) implicitly gets parsed as
>>     (hello OR world)? Thus it would be more computationally expensive?
>>     5. Any other advice :) ?
>>
>>
>> Thanks in advance,
>>
>> Ash
>>
>>
>>
>>
>>
>>
>>
-- 
Charlie Hull - Managing Consultant at OpenSource Connections Limited
Founding member of The Search Network <http://www.thesearchnetwork.com> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf>
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II

-- 
This email has been checked for viruses by AVG.
https://www.avg.com

Re: "Slow" Query performance with boosts.

Posted by Ashwin Ramesh <as...@canva.com.INVALID>.

Gentle ping! Promise it's my final one! :)

On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh <as...@canva.com> wrote:

> Hi everyone,
>
> I have a few questions about how we can improve our solr query
> performance, especially for boosts (BF, BQ, boost, etc).
>
> *System Specs:*
> Solr Version: 7.7.x
> Heap Size: 31gb
> Num Docs: >100M
> Shards: 8
> Replication Factor: 6
> Index is completely mapped into memory
>
>
> Example query:
> {
> q=hello world
> qf=title description keywords
> pf=title^0.5
> ps=0
> fq=type:P
> boost:def(boostFieldA,1) // boostFieldA is docValue float type
> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> textField. No docValue, just indexed
> rows:500
> fl=id,score
> }
>
> numFound: >21M
> qTime: 800ms
>
> Experimentation of params:
>
>    - When I remove the boost parameter, the qTime drops to 525ms
>    - When I remove the bf parameter, the qTime dropes to 650ms
>    - When I remove both the boost & bf parameters, the qTime drops to
>    400ms
>
>
> Questions:
>
>    1. Is there any way to improve the performance of the boosts (specific
>    field types, etc)?
>    2. Will sharding further such that each core only has to score a
>    smaller subset of documents help with query performance?
>    3. Is there any performance impact when boosting/querying against
>    sparse fields, both indexed=true or docValues=true?
>    4. It seems the base case scoring is 400ms, which is already quite
>    high. Is this because the query (hello world) implicitly gets parsed as
>    (hello OR world)? Thus it would be more computationally expensive?
>    5. Any other advice :) ?
>
>
> Thanks in advance,
>
> Ash
>
>
>
>
>
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates> 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>