You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Francis Crimmins <fc...@nla.gov.au> on 2021/06/22 00:50:21 UTC

Excessive query expansion when using WordDelimiterGraphFilter

Hi:

We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an index of over 250 million documents. This powers the search at:

    https://trove.nla.gov.au/

We are using the WordDelimiterGraphFilter in the filter chain for queries:

    https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter

and have found that for some queries it generates a very large number of clauses, which causes excessive CPU load on our cluster. In some cases we have had to restart the cluster.

For example, when we look at the “parsedquery” for the original user query:

(McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR M'Gillan OR M'Gillon)

We see it contains:

(+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin m gillin mgillan mgillon"~5 ….

The above is an excerpt, with the actual query containing 512 of these clauses for this field. Other fields also have the same expansion happening, so the overall query may have a very large number of clauses being specified.

From running through a debugger we can see that QueryBuilder.analyzeGraphPhrase() is the method generating all the various permutations. It has the comment:

“Creates a boolean query from the graph token stream by extracting all the finite strings from the graph and using them to create phrase queries with the appropriate slop.”

If we switch back to the WordDelimiterFilter (which is now deprecated), the parsed query only gets the following added in:

fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion (mc mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) gillan (m mgillon) gillon"~5)

This does not generate the load seen in the other configuration.

From what we can tell the WordDelimiterGraphFilter and how its output gets parsed is working as expected/documented. We have come across the following issue:

    https://issues.apache.org/jira/browse/SOLR-13336

but we’re not sure that setting:

    https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses

to a much lower number is the best solution here.

We would like to find a way to avoid the excessive query expansion we are seeing, and are wondering if anyone else has encountered this problem?

Any advice or suggestions gratefully received.

Thanks,

Francis.

--
Francis Crimmins | Senior Software Engineer | National Library of Australia
M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>

The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.

Re: Excessive query expansion when using WordDelimiterGraphFilter

Posted by Francis Crimmins <fc...@nla.gov.au>.

Hi Michael:

Apologies for the delay in replying to you, I’ve been on leave since the 24th and am just back in work today.

We have tried setting “enableGraphQueries” to false as you suggested, and switched back to the WordDelimiterGraphFilter. We now no longer see the excessive query expansion. Since this is a supported configuration setting and we don’t need to fall back to a now deprecated class then this sounds like the best solution for us.

Thanks once again for your detailed suggestions, we appreciate it!

Cheers,

Francis.

--
Francis Crimmins | Senior Software Engineer | National Library of Australia
M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>

The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.


From: Michael Gibney <mi...@michaelgibney.net>
Date: Thursday, 24 June 2021 at 4:30 am
To: users@solr.apache.org <us...@solr.apache.org>
Subject: Re: Excessive query expansion when using WordDelimiterGraphFilter
Right, I forgot to address the question of the legacy
WordDelimiterFilter. Still only considering query-time analysis, I'm
90% confident that the issue there is that you'd be trading WDGF and
its resulting "exponential-but-correct" query expansion, in favor of
the "linear-but-incorrect" queries that result from legacy WDF. This
may actually be a viable solution for you, but it would effectively
break the way these graph queries are supposed to work.

Have you tried setting the fieldType `enableGraphQueries` property [1]
to false? The documentation is a little opaque, but looking at the
code in QueryBuilder it looks like that property forces the use of
MultiPhraseQuery, which I don't think suffers the same exponential
expansion issue. If this works the way I think it should, it would
indeed be the kind of "simple toggle switch" that you're hoping for,
and would probably also be the "least incorrect" way to avoid
exponential expansion altogether. (I'm not certain, but I think this
might actually result in building the same query you were getting when
swapping WDGF for WDF) ... and notably, your "phrase slop" of 5 should
give enough headroom for most "normal queries" to avoid false
negatives as a result of the "not-quite-correct" MultiPhraseQuery ...

I still think you might get better functionality by indexing both ways
(1 split, 1 catenate), as I initially suggested; but there's tradeoffs
there to consider (not least index size!), and even that approach
would not be completely "correct", I think ...

Michael

[1] https://solr.apache.org/guide/8_9/field-type-definitions-and-properties.html#general-properties

On Tue, Jun 22, 2021 at 6:26 PM Francis Crimmins <fc...@nla.gov.au> wrote:
>
> Hi Michael:
>
> Thanks for your detailed response, which is much appreciated. We’ll take a look at some of your suggestions.
>
> We know that if we drop back to using the old WordDelimiterFilter we do not see this behaviour, but we would prefer not to use a deprecated class if possible.
>
> It would be great if there was some parameter or configuration that would allow some kind of limit or threshold to be set to avoid this behaviour. We’re concerned that specifying a low “maxbooleanclauses” setting may adversely affect other parts of the query processing (although we’re not sure if that would be the case).
>
> Cheers,
>
> Francis.
>
> --
> Francis Crimmins | Senior Software Engineer | National Library of Australia
> M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>
>
> The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.
>
>
> From: Michael Gibney <mi...@michaelgibney.net>
> Date: Tuesday, 22 June 2021 at 12:50 pm
> To: users@solr.apache.org <us...@solr.apache.org>
> Subject: Re: Excessive query expansion when using WordDelimiterGraphFilter
> Hi Francis,
>
> I have indeed encountered this problem -- though as you've discovered
> it's dependent on specific types of analysis chain config.
>
> You should be able to avoid this behavior by constructing your
> analysis chains such that they don't in practice create graph
> TokenStreams. This advice generally applies more strongly at
> index-time, because the graph structure of the TokenStream is
> discarded at index-time. But in a case like yours (a relatively large
> index, and queries that have proven to be problematic) I think your
> intuition is correct that you could both:
> 1. lower the maxBooleanClauses limit as a failsafe, and also
> 2. mitigate negative functional consequences by modifying your
> analysis chains to reduce or eliminate this exponential expansion in
> the first place
>
> My recommendation would be to index into two separate fulltext fields:
> one with wdgf configured to only split (i.e.,
> "generate*Parts"/"splitOn", etc.), one with wdgf configured to only
> concatenate (i.e., "preserveOriginal", "catenate*") This should
> prevent graph token streams from being created, and should thus
> prevent the kind of exponential expansion you're seeing here.
>
> If you really want to have your cake and eat it too (to the extent
> that's possible at the moment), you could use two fields configured as
> mentioned above for _phrase_ searching (i.e.,
> `pf=fulltext_split,fulltext_catenate`), and have a third field with
> wgdf configured to split _and_ catenate (which I infer is the current
> configuration?) and use that as the query field (i.e.,
> `qf=fulltext_split_and_catenate`). The problem is really the implicit
> phrase queries (`pf`) in this case; so in fact you might get passable
> results by simply disabling `pf` altogether (setting it to empty) --
> though that would be a very blunt instrument, so I wouldn't recommend
> disabling `pf` unless absolutely necessary.
>
> The bigger picture here is also interesting (to me, at least!): there
> were some initial steps towards more general application of true
> "positional graph queries" (e.g., LUCENE-7638 [1]). But due to
> problems fundamental to positional graph queries (well described in
> LUCENE-7398 [2], though not unique to span queries), much of this
> functionality has been effectively removed from the most common query
> parsers (e.g., LUCENE-8477 [3], LUCENE-9207 [4]). The timing of your
> question resonates with me, as I've recently been working to add
> benchmarks that illustrate the performance impact of this "exponential
> expansion" behavior (see the comments tacked onto the end of
> LUCENE-9204 [5]).
>
> Michael
>
> [1] https://issues.apache.org/jira/browse/LUCENE-7638
> [2] https://issues.apache.org/jira/browse/LUCENE-7398
> [3] https://issues.apache.org/jira/browse/LUCENE-8477
> [4] https://issues.apache.org/jira/browse/LUCENE-9207
> [5] https://issues.apache.org/jira/browse/LUCENE-9204
>
> ps- Some further relevant issues/blog posts:
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> https://michaelgibney.net/lucene/graph/
>
>
> On Mon, Jun 21, 2021 at 8:50 PM Francis Crimmins <fc...@nla.gov.au> wrote:
> >
> > Hi:
> >
> > We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an index of over 250 million documents. This powers the search at:
> >
> >     https://trove.nla.gov.au/
> >
> > We are using the WordDelimiterGraphFilter in the filter chain for queries:
> >
> >     https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter
> >
> > and have found that for some queries it generates a very large number of clauses, which causes excessive CPU load on our cluster. In some cases we have had to restart the cluster.
> >
> > For example, when we look at the “parsedquery” for the original user query:
> >
> > (McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR M'Gillan OR M'Gillon)
> >
> > We see it contains:
> >
> > (+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin m gillin mgillan mgillon"~5 ….
> >
> > The above is an excerpt, with the actual query containing 512 of these clauses for this field. Other fields also have the same expansion happening, so the overall query may have a very large number of clauses being specified.
> >
> > From running through a debugger we can see that QueryBuilder.analyzeGraphPhrase() is the method generating all the various permutations. It has the comment:
> >
> > “Creates a boolean query from the graph token stream by extracting all the finite strings from the graph and using them to create phrase queries with the appropriate slop.”
> >
> > If we switch back to the WordDelimiterFilter (which is now deprecated), the parsed query only gets the following added in:
> >
> > fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion (mc mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) gillan (m mgillon) gillon"~5)
> >
> > This does not generate the load seen in the other configuration.
> >
> > From what we can tell the WordDelimiterGraphFilter and how its output gets parsed is working as expected/documented. We have come across the following issue:
> >
> >     https://issues.apache.org/jira/browse/SOLR-13336
> >
> > but we’re not sure that setting:
> >
> >     https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses
> >
> > to a much lower number is the best solution here.
> >
> > We would like to find a way to avoid the excessive query expansion we are seeing, and are wondering if anyone else has encountered this problem?
> >
> > Any advice or suggestions gratefully received.
> >
> > Thanks,
> >
> > Francis.
> >
> > --
> > Francis Crimmins | Senior Software Engineer | National Library of Australia
> > M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>
> >
> > The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.
> >

Re: Excessive query expansion when using WordDelimiterGraphFilter

Posted by Michael Gibney <mi...@michaelgibney.net>.

Right, I forgot to address the question of the legacy
WordDelimiterFilter. Still only considering query-time analysis, I'm
90% confident that the issue there is that you'd be trading WDGF and
its resulting "exponential-but-correct" query expansion, in favor of
the "linear-but-incorrect" queries that result from legacy WDF. This
may actually be a viable solution for you, but it would effectively
break the way these graph queries are supposed to work.

Have you tried setting the fieldType `enableGraphQueries` property [1]
to false? The documentation is a little opaque, but looking at the
code in QueryBuilder it looks like that property forces the use of
MultiPhraseQuery, which I don't think suffers the same exponential
expansion issue. If this works the way I think it should, it would
indeed be the kind of "simple toggle switch" that you're hoping for,
and would probably also be the "least incorrect" way to avoid
exponential expansion altogether. (I'm not certain, but I think this
might actually result in building the same query you were getting when
swapping WDGF for WDF) ... and notably, your "phrase slop" of 5 should
give enough headroom for most "normal queries" to avoid false
negatives as a result of the "not-quite-correct" MultiPhraseQuery ...

I still think you might get better functionality by indexing both ways
(1 split, 1 catenate), as I initially suggested; but there's tradeoffs
there to consider (not least index size!), and even that approach
would not be completely "correct", I think ...

Michael

[1] https://solr.apache.org/guide/8_9/field-type-definitions-and-properties.html#general-properties

On Tue, Jun 22, 2021 at 6:26 PM Francis Crimmins <fc...@nla.gov.au> wrote:
>
> Hi Michael:
>
> Thanks for your detailed response, which is much appreciated. We’ll take a look at some of your suggestions.
>
> We know that if we drop back to using the old WordDelimiterFilter we do not see this behaviour, but we would prefer not to use a deprecated class if possible.
>
> It would be great if there was some parameter or configuration that would allow some kind of limit or threshold to be set to avoid this behaviour. We’re concerned that specifying a low “maxbooleanclauses” setting may adversely affect other parts of the query processing (although we’re not sure if that would be the case).
>
> Cheers,
>
> Francis.
>
> --
> Francis Crimmins | Senior Software Engineer | National Library of Australia
> M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>
>
> The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.
>
>
> From: Michael Gibney <mi...@michaelgibney.net>
> Date: Tuesday, 22 June 2021 at 12:50 pm
> To: users@solr.apache.org <us...@solr.apache.org>
> Subject: Re: Excessive query expansion when using WordDelimiterGraphFilter
> Hi Francis,
>
> I have indeed encountered this problem -- though as you've discovered
> it's dependent on specific types of analysis chain config.
>
> You should be able to avoid this behavior by constructing your
> analysis chains such that they don't in practice create graph
> TokenStreams. This advice generally applies more strongly at
> index-time, because the graph structure of the TokenStream is
> discarded at index-time. But in a case like yours (a relatively large
> index, and queries that have proven to be problematic) I think your
> intuition is correct that you could both:
> 1. lower the maxBooleanClauses limit as a failsafe, and also
> 2. mitigate negative functional consequences by modifying your
> analysis chains to reduce or eliminate this exponential expansion in
> the first place
>
> My recommendation would be to index into two separate fulltext fields:
> one with wdgf configured to only split (i.e.,
> "generate*Parts"/"splitOn", etc.), one with wdgf configured to only
> concatenate (i.e., "preserveOriginal", "catenate*") This should
> prevent graph token streams from being created, and should thus
> prevent the kind of exponential expansion you're seeing here.
>
> If you really want to have your cake and eat it too (to the extent
> that's possible at the moment), you could use two fields configured as
> mentioned above for _phrase_ searching (i.e.,
> `pf=fulltext_split,fulltext_catenate`), and have a third field with
> wgdf configured to split _and_ catenate (which I infer is the current
> configuration?) and use that as the query field (i.e.,
> `qf=fulltext_split_and_catenate`). The problem is really the implicit
> phrase queries (`pf`) in this case; so in fact you might get passable
> results by simply disabling `pf` altogether (setting it to empty) --
> though that would be a very blunt instrument, so I wouldn't recommend
> disabling `pf` unless absolutely necessary.
>
> The bigger picture here is also interesting (to me, at least!): there
> were some initial steps towards more general application of true
> "positional graph queries" (e.g., LUCENE-7638 [1]). But due to
> problems fundamental to positional graph queries (well described in
> LUCENE-7398 [2], though not unique to span queries), much of this
> functionality has been effectively removed from the most common query
> parsers (e.g., LUCENE-8477 [3], LUCENE-9207 [4]). The timing of your
> question resonates with me, as I've recently been working to add
> benchmarks that illustrate the performance impact of this "exponential
> expansion" behavior (see the comments tacked onto the end of
> LUCENE-9204 [5]).
>
> Michael
>
> [1] https://issues.apache.org/jira/browse/LUCENE-7638
> [2] https://issues.apache.org/jira/browse/LUCENE-7398
> [3] https://issues.apache.org/jira/browse/LUCENE-8477
> [4] https://issues.apache.org/jira/browse/LUCENE-9207
> [5] https://issues.apache.org/jira/browse/LUCENE-9204
>
> ps- Some further relevant issues/blog posts:
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> https://michaelgibney.net/lucene/graph/
>
>
> On Mon, Jun 21, 2021 at 8:50 PM Francis Crimmins <fc...@nla.gov.au> wrote:
> >
> > Hi:
> >
> > We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an index of over 250 million documents. This powers the search at:
> >
> >     https://trove.nla.gov.au/
> >
> > We are using the WordDelimiterGraphFilter in the filter chain for queries:
> >
> >     https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter
> >
> > and have found that for some queries it generates a very large number of clauses, which causes excessive CPU load on our cluster. In some cases we have had to restart the cluster.
> >
> > For example, when we look at the “parsedquery” for the original user query:
> >
> > (McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR M'Gillan OR M'Gillon)
> >
> > We see it contains:
> >
> > (+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin m gillin mgillan mgillon"~5 ….
> >
> > The above is an excerpt, with the actual query containing 512 of these clauses for this field. Other fields also have the same expansion happening, so the overall query may have a very large number of clauses being specified.
> >
> > From running through a debugger we can see that QueryBuilder.analyzeGraphPhrase() is the method generating all the various permutations. It has the comment:
> >
> > “Creates a boolean query from the graph token stream by extracting all the finite strings from the graph and using them to create phrase queries with the appropriate slop.”
> >
> > If we switch back to the WordDelimiterFilter (which is now deprecated), the parsed query only gets the following added in:
> >
> > fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion (mc mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) gillan (m mgillon) gillon"~5)
> >
> > This does not generate the load seen in the other configuration.
> >
> > From what we can tell the WordDelimiterGraphFilter and how its output gets parsed is working as expected/documented. We have come across the following issue:
> >
> >     https://issues.apache.org/jira/browse/SOLR-13336
> >
> > but we’re not sure that setting:
> >
> >     https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses
> >
> > to a much lower number is the best solution here.
> >
> > We would like to find a way to avoid the excessive query expansion we are seeing, and are wondering if anyone else has encountered this problem?
> >
> > Any advice or suggestions gratefully received.
> >
> > Thanks,
> >
> > Francis.
> >
> > --
> > Francis Crimmins | Senior Software Engineer | National Library of Australia
> > M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>
> >
> > The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.
> >

Re: Excessive query expansion when using WordDelimiterGraphFilter

Posted by Francis Crimmins <fc...@nla.gov.au>.

Hi Michael:

Thanks for your detailed response, which is much appreciated. We’ll take a look at some of your suggestions.

We know that if we drop back to using the old WordDelimiterFilter we do not see this behaviour, but we would prefer not to use a deprecated class if possible.

It would be great if there was some parameter or configuration that would allow some kind of limit or threshold to be set to avoid this behaviour. We’re concerned that specifying a low “maxbooleanclauses” setting may adversely affect other parts of the query processing (although we’re not sure if that would be the case).

Cheers,

Francis.

--
Francis Crimmins | Senior Software Engineer | National Library of Australia
M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>

The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.

From: Michael Gibney <mi...@michaelgibney.net>
Date: Tuesday, 22 June 2021 at 12:50 pm
To: users@solr.apache.org <us...@solr.apache.org>
Subject: Re: Excessive query expansion when using WordDelimiterGraphFilter
Hi Francis,

I have indeed encountered this problem -- though as you've discovered
it's dependent on specific types of analysis chain config.

You should be able to avoid this behavior by constructing your
analysis chains such that they don't in practice create graph
TokenStreams. This advice generally applies more strongly at
index-time, because the graph structure of the TokenStream is
discarded at index-time. But in a case like yours (a relatively large
index, and queries that have proven to be problematic) I think your
intuition is correct that you could both:
1. lower the maxBooleanClauses limit as a failsafe, and also
2. mitigate negative functional consequences by modifying your
analysis chains to reduce or eliminate this exponential expansion in
the first place

My recommendation would be to index into two separate fulltext fields:
one with wdgf configured to only split (i.e.,
"generate*Parts"/"splitOn", etc.), one with wdgf configured to only
concatenate (i.e., "preserveOriginal", "catenate*") This should
prevent graph token streams from being created, and should thus
prevent the kind of exponential expansion you're seeing here.

If you really want to have your cake and eat it too (to the extent
that's possible at the moment), you could use two fields configured as
mentioned above for _phrase_ searching (i.e.,
`pf=fulltext_split,fulltext_catenate`), and have a third field with
wgdf configured to split _and_ catenate (which I infer is the current
configuration?) and use that as the query field (i.e.,
`qf=fulltext_split_and_catenate`). The problem is really the implicit
phrase queries (`pf`) in this case; so in fact you might get passable
results by simply disabling `pf` altogether (setting it to empty) --
though that would be a very blunt instrument, so I wouldn't recommend
disabling `pf` unless absolutely necessary.

The bigger picture here is also interesting (to me, at least!): there
were some initial steps towards more general application of true
"positional graph queries" (e.g., LUCENE-7638 [1]). But due to
problems fundamental to positional graph queries (well described in
LUCENE-7398 [2], though not unique to span queries), much of this
functionality has been effectively removed from the most common query
parsers (e.g., LUCENE-8477 [3], LUCENE-9207 [4]). The timing of your
question resonates with me, as I've recently been working to add
benchmarks that illustrate the performance impact of this "exponential
expansion" behavior (see the comments tacked onto the end of
LUCENE-9204 [5]).

Michael

[1] https://issues.apache.org/jira/browse/LUCENE-7638
[2] https://issues.apache.org/jira/browse/LUCENE-7398
[3] https://issues.apache.org/jira/browse/LUCENE-8477
[4] https://issues.apache.org/jira/browse/LUCENE-9207
[5] https://issues.apache.org/jira/browse/LUCENE-9204

ps- Some further relevant issues/blog posts:
https://issues.apache.org/jira/browse/LUCENE-4312

https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
https://michaelgibney.net/lucene/graph/

On Mon, Jun 21, 2021 at 8:50 PM Francis Crimmins <fc...@nla.gov.au> wrote:
>
> Hi:
>
> We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an index of over 250 million documents. This powers the search at:
>
>     https://trove.nla.gov.au/
>
> We are using the WordDelimiterGraphFilter in the filter chain for queries:
>
>     https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter
>
> and have found that for some queries it generates a very large number of clauses, which causes excessive CPU load on our cluster. In some cases we have had to restart the cluster.
>
> For example, when we look at the “parsedquery” for the original user query:
>
> (McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR M'Gillan OR M'Gillon)
>
> We see it contains:
>
> (+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin m gillin mgillan mgillon"~5 ….
>
> The above is an excerpt, with the actual query containing 512 of these clauses for this field. Other fields also have the same expansion happening, so the overall query may have a very large number of clauses being specified.
>
> From running through a debugger we can see that QueryBuilder.analyzeGraphPhrase() is the method generating all the various permutations. It has the comment:
>
> “Creates a boolean query from the graph token stream by extracting all the finite strings from the graph and using them to create phrase queries with the appropriate slop.”
>
> If we switch back to the WordDelimiterFilter (which is now deprecated), the parsed query only gets the following added in:
>
> fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion (mc mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) gillan (m mgillon) gillon"~5)
>
> This does not generate the load seen in the other configuration.
>
> From what we can tell the WordDelimiterGraphFilter and how its output gets parsed is working as expected/documented. We have come across the following issue:
>
>     https://issues.apache.org/jira/browse/SOLR-13336
>
> but we’re not sure that setting:
>
>     https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses
>
> to a much lower number is the best solution here.
>
> We would like to find a way to avoid the excessive query expansion we are seeing, and are wondering if anyone else has encountered this problem?
>
> Any advice or suggestions gratefully received.
>
> Thanks,
>
> Francis.
>
> --
> Francis Crimmins | Senior Software Engineer | National Library of Australia
> M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>
>
> The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.
>

Re: Excessive query expansion when using WordDelimiterGraphFilter

Posted by Michael Gibney <mi...@michaelgibney.net>.

Hi Francis,

I have indeed encountered this problem -- though as you've discovered
it's dependent on specific types of analysis chain config.

You should be able to avoid this behavior by constructing your
analysis chains such that they don't in practice create graph
TokenStreams. This advice generally applies more strongly at
index-time, because the graph structure of the TokenStream is
discarded at index-time. But in a case like yours (a relatively large
index, and queries that have proven to be problematic) I think your
intuition is correct that you could both:
1. lower the maxBooleanClauses limit as a failsafe, and also
2. mitigate negative functional consequences by modifying your
analysis chains to reduce or eliminate this exponential expansion in
the first place

My recommendation would be to index into two separate fulltext fields:
one with wdgf configured to only split (i.e.,
"generate*Parts"/"splitOn", etc.), one with wdgf configured to only
concatenate (i.e., "preserveOriginal", "catenate*") This should
prevent graph token streams from being created, and should thus
prevent the kind of exponential expansion you're seeing here.

If you really want to have your cake and eat it too (to the extent
that's possible at the moment), you could use two fields configured as
mentioned above for _phrase_ searching (i.e.,
`pf=fulltext_split,fulltext_catenate`), and have a third field with
wgdf configured to split _and_ catenate (which I infer is the current
configuration?) and use that as the query field (i.e.,
`qf=fulltext_split_and_catenate`). The problem is really the implicit
phrase queries (`pf`) in this case; so in fact you might get passable
results by simply disabling `pf` altogether (setting it to empty) --
though that would be a very blunt instrument, so I wouldn't recommend
disabling `pf` unless absolutely necessary.

The bigger picture here is also interesting (to me, at least!): there
were some initial steps towards more general application of true
"positional graph queries" (e.g., LUCENE-7638 [1]). But due to
problems fundamental to positional graph queries (well described in
LUCENE-7398 [2], though not unique to span queries), much of this
functionality has been effectively removed from the most common query
parsers (e.g., LUCENE-8477 [3], LUCENE-9207 [4]). The timing of your
question resonates with me, as I've recently been working to add
benchmarks that illustrate the performance impact of this "exponential
expansion" behavior (see the comments tacked onto the end of
LUCENE-9204 [5]).

Michael

[1] https://issues.apache.org/jira/browse/LUCENE-7638
[2] https://issues.apache.org/jira/browse/LUCENE-7398
[3] https://issues.apache.org/jira/browse/LUCENE-8477
[4] https://issues.apache.org/jira/browse/LUCENE-9207
[5] https://issues.apache.org/jira/browse/LUCENE-9204

ps- Some further relevant issues/blog posts:
https://issues.apache.org/jira/browse/LUCENE-4312

https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
https://michaelgibney.net/lucene/graph/

On Mon, Jun 21, 2021 at 8:50 PM Francis Crimmins <fc...@nla.gov.au> wrote:
>
> Hi:
>
> We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an index of over 250 million documents. This powers the search at:
>
>     https://trove.nla.gov.au/
>
> We are using the WordDelimiterGraphFilter in the filter chain for queries:
>
>     https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter
>
> and have found that for some queries it generates a very large number of clauses, which causes excessive CPU load on our cluster. In some cases we have had to restart the cluster.
>
> For example, when we look at the “parsedquery” for the original user query:
>
> (McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR M'Gillan OR M'Gillon)
>
> We see it contains:
>
> (+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin m gillin mgillan mgillon"~5 ….
>
> The above is an excerpt, with the actual query containing 512 of these clauses for this field. Other fields also have the same expansion happening, so the overall query may have a very large number of clauses being specified.
>
> From running through a debugger we can see that QueryBuilder.analyzeGraphPhrase() is the method generating all the various permutations. It has the comment:
>
> “Creates a boolean query from the graph token stream by extracting all the finite strings from the graph and using them to create phrase queries with the appropriate slop.”
>
> If we switch back to the WordDelimiterFilter (which is now deprecated), the parsed query only gets the following added in:
>
> fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion (mc mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) gillan (m mgillon) gillon"~5)
>
> This does not generate the load seen in the other configuration.
>
> From what we can tell the WordDelimiterGraphFilter and how its output gets parsed is working as expected/documented. We have come across the following issue:
>
>     https://issues.apache.org/jira/browse/SOLR-13336
>
> but we’re not sure that setting:
>
>     https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses
>
> to a much lower number is the best solution here.
>
> We would like to find a way to avoid the excessive query expansion we are seeing, and are wondering if anyone else has encountered this problem?
>
> Any advice or suggestions gratefully received.
>
> Thanks,
>
> Francis.
>
> --
> Francis Crimmins | Senior Software Engineer | National Library of Australia
> M: +61 0433 545 884 | E: fcrimmins@nla.gov.au | nla.gov.au<http://nla.gov.au/>
>
> The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.
>