You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Sjoerd Smeets <ss...@gmail.com> on 2024/03/25 07:56:19 UTC

Slow performance for phrases with terms with high ttf

Hi,

We are experiencing quite a performance decrease when searching for phrases
that have terms with a high ttf value.

E.g. searching for "note of sale" is around 3 times slower (~10 sec) than
the "bill of sale" `(~3 sec). This behaviour is consistent and can be
reproduced als when we use other terms that have a high ttf. We are
querying the unstemmed index.

Terms (numDocs: 26220184):

   - "bill" -> df = 1.879.324, ttf = 14.145.950
   - "note" -> df = 8.479.826, ttf = 151.249.542
   - "sale" -> df = 7.557.685, ttf = 12.0948.163
   - "bill" -> df = 21.244.060, ttf = 6.879.196.700


Is this the expected behaviour or is there something that can be
tuned, like a cache setting?

Thanks,
Sjoerd

Re: Slow performance for phrases with terms with high ttf

Posted by Sjoerd Smeets <ss...@gmail.com>.

Thanks Doug, we'll experiment and let you know how it went.

On Mon, Mar 25, 2024 at 3:59 PM Doug Turnbull
<do...@reddit.com.invalid> wrote:

> It could help yeah with parallelizing it. With the tradeoff that you'll
> only be as fast as your slowest shard (ie tail latency). So more shards
> mean one shard having a bad hair day, doing a GC, or something will
> increase the risk of slowing things down, and probably increase the
> variance in the overall response time. So definitely look at p90+ changes,
> not just p50.
>
> On Mon, Mar 25, 2024 at 10:51 AM Sjoerd Smeets <ss...@gmail.com> wrote:
>
> > Thanks Doug!
> >
> > Do you think adding more shards would help in this case? Putting the
> index
> > in memory is not really possible as the index is up to 2.5Tb. We have
> SSD's
> > though, so that is the closest we can get. We have 16 CPUs and configured
> > it for 4 shards. Would splitting it up in more shards potentially help?
> > We'll run some experiments anyway.
> >
> > On Mon, Mar 25, 2024 at 3:19 PM Doug Turnbull
> > <do...@reddit.com.invalid> wrote:
> >
> > > As someone currently implementing a lot of positional search from
> scratch
> > > (in a different side-project), I can say it's totally expected behavior
> > > that high TTF / DF terms would be harder. To match the phrase there's
> > > simply more candidate documents and positions to intersect, so it's
> > > naturally a tougher problem.
> > >
> > > If you think about how phrase search works, you might roughly think you
> > > 1. Find all documents with every term
> > > 2. Iterate positions of these documents so that "Bill" is exactly one
> > > before "Of" exactly one before "sale"... etc
> > >
> > > I'd say the best you could do is:
> > >
> > > 1. Make sure your index can fit in memory.
> > > 2. Ensure you add any filters (fq) if you have any mandatory
> > requirements.
> > > Add a filter cache. Don't cache anything that's query-dependent
> > > 3. If its a really common phrase, think about tokenizing it into a
> single
> > > term "bill of sale" -> "bill_of_sale" which you could do outside the
> > search
> > > engine or with text analysis. With the downside you lose the ability to
> > > match the individual terms. You could of course create a different
> field
> > > for these significant phrases if its important.
> > >
> > > Best
> > > -Doug
> > >
> > > On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ss...@gmail.com>
> wrote:
> > >
> > > > There is a typo in my email. The term list should be like this:
> > > >
> > > >
> > > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > > >    - "of" -> df = 21.244.060, ttf = 6.879.196.700
> > > >
> > > >
> > > > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ss...@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are experiencing quite a performance decrease when searching for
> > > > > phrases that have terms with a high ttf value.
> > > > >
> > > > > E.g. searching for "note of sale" is around 3 times slower (~10
> sec)
> > > than
> > > > > the "bill of sale" `(~3 sec). This behaviour is consistent and can
> be
> > > > > reproduced als when we use other terms that have a high ttf. We are
> > > > > querying the unstemmed index.
> > > > >
> > > > > Terms (numDocs: 26220184):
> > > > >
> > > > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > > > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > > > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > > > >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
> > > > >
> > > > >
> > > > > Is this the expected behaviour or is there something that can be
> > > > > tuned, like a cache setting?
> > > > >
> > > > > Thanks,
> > > > > Sjoerd
> > > > >
> > > >
> > >
> >
>

Re: Slow performance for phrases with terms with high ttf

Posted by Doug Turnbull <do...@reddit.com.INVALID>.

It could help yeah with parallelizing it. With the tradeoff that you'll
only be as fast as your slowest shard (ie tail latency). So more shards
mean one shard having a bad hair day, doing a GC, or something will
increase the risk of slowing things down, and probably increase the
variance in the overall response time. So definitely look at p90+ changes,
not just p50.

On Mon, Mar 25, 2024 at 10:51 AM Sjoerd Smeets <ss...@gmail.com> wrote:

> Thanks Doug!
>
> Do you think adding more shards would help in this case? Putting the index
> in memory is not really possible as the index is up to 2.5Tb. We have SSD's
> though, so that is the closest we can get. We have 16 CPUs and configured
> it for 4 shards. Would splitting it up in more shards potentially help?
> We'll run some experiments anyway.
>
> On Mon, Mar 25, 2024 at 3:19 PM Doug Turnbull
> <do...@reddit.com.invalid> wrote:
>
> > As someone currently implementing a lot of positional search from scratch
> > (in a different side-project), I can say it's totally expected behavior
> > that high TTF / DF terms would be harder. To match the phrase there's
> > simply more candidate documents and positions to intersect, so it's
> > naturally a tougher problem.
> >
> > If you think about how phrase search works, you might roughly think you
> > 1. Find all documents with every term
> > 2. Iterate positions of these documents so that "Bill" is exactly one
> > before "Of" exactly one before "sale"... etc
> >
> > I'd say the best you could do is:
> >
> > 1. Make sure your index can fit in memory.
> > 2. Ensure you add any filters (fq) if you have any mandatory
> requirements.
> > Add a filter cache. Don't cache anything that's query-dependent
> > 3. If its a really common phrase, think about tokenizing it into a single
> > term "bill of sale" -> "bill_of_sale" which you could do outside the
> search
> > engine or with text analysis. With the downside you lose the ability to
> > match the individual terms. You could of course create a different field
> > for these significant phrases if its important.
> >
> > Best
> > -Doug
> >
> > On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ss...@gmail.com> wrote:
> >
> > > There is a typo in my email. The term list should be like this:
> > >
> > >
> > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > >    - "of" -> df = 21.244.060, ttf = 6.879.196.700
> > >
> > >
> > > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ss...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > We are experiencing quite a performance decrease when searching for
> > > > phrases that have terms with a high ttf value.
> > > >
> > > > E.g. searching for "note of sale" is around 3 times slower (~10 sec)
> > than
> > > > the "bill of sale" `(~3 sec). This behaviour is consistent and can be
> > > > reproduced als when we use other terms that have a high ttf. We are
> > > > querying the unstemmed index.
> > > >
> > > > Terms (numDocs: 26220184):
> > > >
> > > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > > >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
> > > >
> > > >
> > > > Is this the expected behaviour or is there something that can be
> > > > tuned, like a cache setting?
> > > >
> > > > Thanks,
> > > > Sjoerd
> > > >
> > >
> >
>

Re: Slow performance for phrases with terms with high ttf

Posted by Sjoerd Smeets <ss...@gmail.com>.

Thanks Doug!

Do you think adding more shards would help in this case? Putting the index
in memory is not really possible as the index is up to 2.5Tb. We have SSD's
though, so that is the closest we can get. We have 16 CPUs and configured
it for 4 shards. Would splitting it up in more shards potentially help?
We'll run some experiments anyway.

On Mon, Mar 25, 2024 at 3:19 PM Doug Turnbull
<do...@reddit.com.invalid> wrote:

> As someone currently implementing a lot of positional search from scratch
> (in a different side-project), I can say it's totally expected behavior
> that high TTF / DF terms would be harder. To match the phrase there's
> simply more candidate documents and positions to intersect, so it's
> naturally a tougher problem.
>
> If you think about how phrase search works, you might roughly think you
> 1. Find all documents with every term
> 2. Iterate positions of these documents so that "Bill" is exactly one
> before "Of" exactly one before "sale"... etc
>
> I'd say the best you could do is:
>
> 1. Make sure your index can fit in memory.
> 2. Ensure you add any filters (fq) if you have any mandatory requirements.
> Add a filter cache. Don't cache anything that's query-dependent
> 3. If its a really common phrase, think about tokenizing it into a single
> term "bill of sale" -> "bill_of_sale" which you could do outside the search
> engine or with text analysis. With the downside you lose the ability to
> match the individual terms. You could of course create a different field
> for these significant phrases if its important.
>
> Best
> -Doug
>
> On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ss...@gmail.com> wrote:
>
> > There is a typo in my email. The term list should be like this:
> >
> >
> >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> >    - "note" -> df = 8.479.826, ttf = 151.249.542
> >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> >    - "of" -> df = 21.244.060, ttf = 6.879.196.700
> >
> >
> > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ss...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > We are experiencing quite a performance decrease when searching for
> > > phrases that have terms with a high ttf value.
> > >
> > > E.g. searching for "note of sale" is around 3 times slower (~10 sec)
> than
> > > the "bill of sale" `(~3 sec). This behaviour is consistent and can be
> > > reproduced als when we use other terms that have a high ttf. We are
> > > querying the unstemmed index.
> > >
> > > Terms (numDocs: 26220184):
> > >
> > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
> > >
> > >
> > > Is this the expected behaviour or is there something that can be
> > > tuned, like a cache setting?
> > >
> > > Thanks,
> > > Sjoerd
> > >
> >
>

Re: Slow performance for phrases with terms with high ttf

Posted by Chris Hostetter <ho...@fucit.org>.

This is also the sort of thing CommonGramsFilter ws designed for...

https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#common-grams-filter


: Date: Mon, 25 Mar 2024 10:17:48 -0400
: From: Doug Turnbull <do...@reddit.com.invalid>
: Reply-To: users@solr.apache.org
: To: users@solr.apache.org
: Subject: Re: Slow performance for phrases with terms with high ttf
: 
: As someone currently implementing a lot of positional search from scratch
: (in a different side-project), I can say it's totally expected behavior
: that high TTF / DF terms would be harder. To match the phrase there's
: simply more candidate documents and positions to intersect, so it's
: naturally a tougher problem.
: 
: If you think about how phrase search works, you might roughly think you
: 1. Find all documents with every term
: 2. Iterate positions of these documents so that "Bill" is exactly one
: before "Of" exactly one before "sale"... etc
: 
: I'd say the best you could do is:
: 
: 1. Make sure your index can fit in memory.
: 2. Ensure you add any filters (fq) if you have any mandatory requirements.
: Add a filter cache. Don't cache anything that's query-dependent
: 3. If its a really common phrase, think about tokenizing it into a single
: term "bill of sale" -> "bill_of_sale" which you could do outside the search
: engine or with text analysis. With the downside you lose the ability to
: match the individual terms. You could of course create a different field
: for these significant phrases if its important.
: 
: Best
: -Doug
: 
: On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ss...@gmail.com> wrote:
: 
: > There is a typo in my email. The term list should be like this:
: >
: >
: >    - "bill" -> df = 1.879.324, ttf = 14.145.950
: >    - "note" -> df = 8.479.826, ttf = 151.249.542
: >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
: >    - "of" -> df = 21.244.060, ttf = 6.879.196.700
: >
: >
: > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ss...@gmail.com> wrote:
: >
: > > Hi,
: > >
: > > We are experiencing quite a performance decrease when searching for
: > > phrases that have terms with a high ttf value.
: > >
: > > E.g. searching for "note of sale" is around 3 times slower (~10 sec) than
: > > the "bill of sale" `(~3 sec). This behaviour is consistent and can be
: > > reproduced als when we use other terms that have a high ttf. We are
: > > querying the unstemmed index.
: > >
: > > Terms (numDocs: 26220184):
: > >
: > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
: > >    - "note" -> df = 8.479.826, ttf = 151.249.542
: > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
: > >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
: > >
: > >
: > > Is this the expected behaviour or is there something that can be
: > > tuned, like a cache setting?
: > >
: > > Thanks,
: > > Sjoerd
: > >
: >
: 

-Hoss
http://www.lucidworks.com/

Re: Slow performance for phrases with terms with high ttf

Posted by Doug Turnbull <do...@reddit.com.INVALID>.

As someone currently implementing a lot of positional search from scratch
(in a different side-project), I can say it's totally expected behavior
that high TTF / DF terms would be harder. To match the phrase there's
simply more candidate documents and positions to intersect, so it's
naturally a tougher problem.

If you think about how phrase search works, you might roughly think you
1. Find all documents with every term
2. Iterate positions of these documents so that "Bill" is exactly one
before "Of" exactly one before "sale"... etc

I'd say the best you could do is:

1. Make sure your index can fit in memory.
2. Ensure you add any filters (fq) if you have any mandatory requirements.
Add a filter cache. Don't cache anything that's query-dependent
3. If its a really common phrase, think about tokenizing it into a single
term "bill of sale" -> "bill_of_sale" which you could do outside the search
engine or with text analysis. With the downside you lose the ability to
match the individual terms. You could of course create a different field
for these significant phrases if its important.

Best
-Doug

On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ss...@gmail.com> wrote:

> There is a typo in my email. The term list should be like this:
>
>
>    - "bill" -> df = 1.879.324, ttf = 14.145.950
>    - "note" -> df = 8.479.826, ttf = 151.249.542
>    - "sale" -> df = 7.557.685, ttf = 12.0948.163
>    - "of" -> df = 21.244.060, ttf = 6.879.196.700
>
>
> On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ss...@gmail.com> wrote:
>
> > Hi,
> >
> > We are experiencing quite a performance decrease when searching for
> > phrases that have terms with a high ttf value.
> >
> > E.g. searching for "note of sale" is around 3 times slower (~10 sec) than
> > the "bill of sale" `(~3 sec). This behaviour is consistent and can be
> > reproduced als when we use other terms that have a high ttf. We are
> > querying the unstemmed index.
> >
> > Terms (numDocs: 26220184):
> >
> >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> >    - "note" -> df = 8.479.826, ttf = 151.249.542
> >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
> >
> >
> > Is this the expected behaviour or is there something that can be
> > tuned, like a cache setting?
> >
> > Thanks,
> > Sjoerd
> >
>

Re: Slow performance for phrases with terms with high ttf

Posted by Sjoerd Smeets <ss...@gmail.com>.

There is a typo in my email. The term list should be like this:


   - "bill" -> df = 1.879.324, ttf = 14.145.950
   - "note" -> df = 8.479.826, ttf = 151.249.542
   - "sale" -> df = 7.557.685, ttf = 12.0948.163
   - "of" -> df = 21.244.060, ttf = 6.879.196.700


On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ss...@gmail.com> wrote:

> Hi,
>
> We are experiencing quite a performance decrease when searching for
> phrases that have terms with a high ttf value.
>
> E.g. searching for "note of sale" is around 3 times slower (~10 sec) than
> the "bill of sale" `(~3 sec). This behaviour is consistent and can be
> reproduced als when we use other terms that have a high ttf. We are
> querying the unstemmed index.
>
> Terms (numDocs: 26220184):
>
>    - "bill" -> df = 1.879.324, ttf = 14.145.950
>    - "note" -> df = 8.479.826, ttf = 151.249.542
>    - "sale" -> df = 7.557.685, ttf = 12.0948.163
>    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
>
>
> Is this the expected behaviour or is there something that can be
> tuned, like a cache setting?
>
> Thanks,
> Sjoerd
>