You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@solr.apache.org by Rudi Seitz <ru...@gmail.com> on 2023/03/17 17:32:45 UTC

Re: seeking feedback on edismax term-centric/field-centric proposal to resolve mm issue

I've made a draft PR for issue I wrote about back in January -- edismax's
unpredictable "flip" between field-centric and term-centric query
structures.

https://github.com/apache/solr/pull/1463

If anyone's interested in this issue but needed to see some code, now
there's a draft to look at.

In a nutshell, here's what the PR does:

1) during query analysis, when field analyzers generate Tokens that get
converted into Terms and eventually into TermQueries, we now store the
startOffset from the Token on the generated TermQuery.

2) when edismax attempts to restructure a field-centric query as a
term-centric one, it now attempts to use a better heuristic than the
previous one. The new approach regroups the query clauses according to
startOffset.

This means that edismax can stay with a term-centric query structure even
when the different field analyzers output differing numbers of tokens.

I had thought the proposed change would require updates in both the lucene
and solr repos, but I found a way to get the draft PR working in a
self-contained way, with only changes inside the solr repo. This did
involve copying the QueryBuilder class from lucene into solr. A final
version of this change would probably want to avoid that duplication and
make the QueryBuilder changes directly in the lucene repo, but I hope the
current approach makes things easier to review & test at this draft stage.

Feedback invited.

Rudi

On Tue, Jan 17, 2023 at 2:45 PM Rudi Seitz <ru...@gmail.com> wrote:

> Hi everyone,
>
> I've been looking into a known issue where edismax sometimes switches from
> a term-centric to a field-centric query generation style. This happens when
> sow=false and the per-field analyzers generate differing numbers of tokens.
> It's a problem worth solving because it causes inconsistency with the
> semantics of the mm parameter.
>
> I wrote a proposal for fixing this in SOLR-16594
> <https://issues.apache.org/jira/browse/SOLR-16594> and am gently nudging
> to see if anyone has feedback on the proposal. Do you think this approach
> might work, or could you help me by explaining why it wouldn't work? It'd
> be great to hear from anyone who's interested in this topic, on the ticket
> directly or via this email thread. Thanks in advance!
>
> Rudi
>
> PS. There's more detail in the ticket, including links to other tickets &
> blog entries, but here's a summary:
>
> 1) The challenge in generating a term-centric query when sow=false is that
> the tokens that come out of an analysis chain don't have explicit pointers
> to the input terms that they should be grouped by.
> 2) When the field analyzers all generate the same number of tokens,
> edismax rewrites an initial set of field-centric clauses as term-centric
> ones, using clause-position as a grouping heuristic, but this doesn't work
> if there are differing numbers of tokens.
> 3) The current proposal is to use the startOffset of a token as the basis
> for doing term-centric grouping.
> 4) There's an implementation challenge here because startOffset is not
> propagated to the Term objects that edismax works with, but it could be.
>

Re: seeking feedback on edismax term-centric/field-centric proposal to resolve mm issue

Posted by Alessandro Benedetti <a....@sease.io>.

Adding Daniele to the loop as he's experiencing a similar problem for a
customer.
I want to take a look at this, but I'm quite busy this period, hope to find
sometime in the next two weeks.
Thanks Rudi for working on this!

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Fri, 17 Mar 2023 at 18:33, Rudi Seitz <ru...@gmail.com> wrote:

> I've made a draft PR for issue I wrote about back in January -- edismax's
> unpredictable "flip" between field-centric and term-centric query
> structures.
>
> https://github.com/apache/solr/pull/1463
>
> If anyone's interested in this issue but needed to see some code, now
> there's a draft to look at.
>
> In a nutshell, here's what the PR does:
>
> 1) during query analysis, when field analyzers generate Tokens that get
> converted into Terms and eventually into TermQueries, we now store the
> startOffset from the Token on the generated TermQuery.
>
> 2) when edismax attempts to restructure a field-centric query as a
> term-centric one, it now attempts to use a better heuristic than the
> previous one. The new approach regroups the query clauses according to
> startOffset.
>
> This means that edismax can stay with a term-centric query structure even
> when the different field analyzers output differing numbers of tokens.
>
> I had thought the proposed change would require updates in both the lucene
> and solr repos, but I found a way to get the draft PR working in a
> self-contained way, with only changes inside the solr repo. This did
> involve copying the QueryBuilder class from lucene into solr. A final
> version of this change would probably want to avoid that duplication and
> make the QueryBuilder changes directly in the lucene repo, but I hope the
> current approach makes things easier to review & test at this draft stage.
>
> Feedback invited.
>
> Rudi
>
> On Tue, Jan 17, 2023 at 2:45 PM Rudi Seitz <ru...@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I've been looking into a known issue where edismax sometimes switches
> from
> > a term-centric to a field-centric query generation style. This happens
> when
> > sow=false and the per-field analyzers generate differing numbers of
> tokens.
> > It's a problem worth solving because it causes inconsistency with the
> > semantics of the mm parameter.
> >
> > I wrote a proposal for fixing this in SOLR-16594
> > <https://issues.apache.org/jira/browse/SOLR-16594> and am gently nudging
> > to see if anyone has feedback on the proposal. Do you think this approach
> > might work, or could you help me by explaining why it wouldn't work? It'd
> > be great to hear from anyone who's interested in this topic, on the
> ticket
> > directly or via this email thread. Thanks in advance!
> >
> > Rudi
> >
> > PS. There's more detail in the ticket, including links to other tickets &
> > blog entries, but here's a summary:
> >
> > 1) The challenge in generating a term-centric query when sow=false is
> that
> > the tokens that come out of an analysis chain don't have explicit
> pointers
> > to the input terms that they should be grouped by.
> > 2) When the field analyzers all generate the same number of tokens,
> > edismax rewrites an initial set of field-centric clauses as term-centric
> > ones, using clause-position as a grouping heuristic, but this doesn't
> work
> > if there are differing numbers of tokens.
> > 3) The current proposal is to use the startOffset of a token as the basis
> > for doing term-centric grouping.
> > 4) There's an implementation challenge here because startOffset is not
> > propagated to the Term objects that edismax works with, but it could be.
> >
>