You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Drini Cami <cd...@gmail.com> on 2021/05/06 18:22:13 UTC

text_en_splitting with quotes not matching when there are 2 adjacent stopwords

Hello! I have a question about the text_en_splitting fieldType (solr 8.8.2,
very vanilla schema). I noticed that it was failing for queries like:
`title:"The
Mark of the Crown"`, but succeeding for queries like `title:The Mark of the
Crown`. Using the solr analysis tool, I noticed that the index analyzer
converts "The Mark of the Crown" to `[_, mark, _, crown]`, but the query
analyzer converts it to `[_, mark, _, _, crown]`. I then noticed the index
analyzer has as a final filter FlattenGraphFilterFactory, which seems to
combine adjacent `_`. I tried also adding FlattenGraphFilterFactory to the
query analyzer and that fixed the issue. Is this a reasonable solution? If
so, should that be the default? Or am I using the wrong fieldType
altogether?

Thank you,

Drini

Re: text_en_splitting with quotes not matching when there are 2 adjacent stopwords

Posted by Alessandro Benedetti <a....@sease.io>.

Hi Drini,
I would recommend investigating the code a bit, that token filter is meant
to flat multiple terms at the same position to make it super simple so It
seems suspicious that merging two adjacent tokens putting generated
incorrect positions is what happens.
Have you checked the positionLength, position attributes of the tokens
generated?

Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Thu, 6 May 2021 at 19:54, Drini Cami <cd...@gmail.com> wrote:

> Hello! I have a question about the text_en_splitting fieldType (solr 8.8.2,
> very vanilla schema). I noticed that it was failing for queries like:
> `title:"The
> Mark of the Crown"`, but succeeding for queries like `title:The Mark of the
> Crown`. Using the solr analysis tool, I noticed that the index analyzer
> converts "The Mark of the Crown" to `[_, mark, _, crown]`, but the query
> analyzer converts it to `[_, mark, _, _, crown]`. I then noticed the index
> analyzer has as a final filter FlattenGraphFilterFactory, which seems to
> combine adjacent `_`. I tried also adding FlattenGraphFilterFactory to the
> query analyzer and that fixed the issue. Is this a reasonable solution? If
> so, should that be the default? Or am I using the wrong fieldType
> altogether?
>
> Thank you,
>
> Drini
>