You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Stephen Lewis Bianamara <st...@gmail.com> on 2022/09/02 19:56:39 UTC

Problem with "sow" and WordGraphDelimeter

Hey Solr Users,

I've noticed an odd behavior between word graph delimiter and the sow
parameter. When the word graph delimiter gets invoked and sow=true, there
is the possibility to miss results which include alpha num splitting but
aren't exact matches. So if I have a document with "ABC123 DEF456_GHI", the
combination of sow=true and WordDelimeterGraph seem to break queries for
"def456". See full repro below.

I believe this is a bug. Could someone please take a look at my repro and
confirm my repro, or let me know if something is misconfigured here?

*Repro*

   - solr 9 with this field type definition for field "test_en"

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class=
"solr.WhitespaceTokenizerFactory"/> <filter class=
"solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateAll="1" preserveOriginal="1"
splitOnCaseChange="1"/> <filter class="solr.FlattenGraphFilterFactory"/> <
filter class="solr.LowerCaseFilterFactory"/> <filter class=
"solr.SnowballPorterFilterFactory"/> </analyzer> <analyzer type="query"> <
tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=
"solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateAll="1" preserveOriginal="1"
splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <
filter class="solr.SnowballPorterFilterFactory"/> </analyzer> </fieldType>

   - Create document {"id": 1, "test_en": ["ABC123 DEF456_GHI"]}
   - Query the following; all should hit, but one combination misses
      - sow=true, q=def456
         - misses
      - sow=true, q=abc123
         - hits
      - sow=false, q=def456
         - hits
      - sow=false, q=abc123
         - hits

Re: Problem with "sow" and WordGraphDelimeter

Posted by Alessandro Benedetti <a....@sease.io>.

My bad, I was adding a colleague of mine to the discussion, but possibly
done in the wrong way!
We are observing some problems in mixing up those two token filters, we may
update the mail thread in the next few days!

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Wed, 12 Apr 2023 at 16:02, Alessandro Benedetti <a....@sease.io>
wrote:

> FYI
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Fri, 9 Sept 2022 at 15:49, Alessandro Benedetti <a....@sease.io>
> wrote:
>
>> Not related to the word-delimiter token filter but I did a study a while
>> ago on the sow parameter, identified a couple of bugs and fixed one (the
>> other was discussed and in the end not accepted as an improvement as it was
>> controversial).
>>
>>
>> https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html
>>
>> Cheers
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Wed, 7 Sept 2022 at 14:19, Markus Jelsma <ma...@openindex.io>
>> wrote:
>>
>>> Hello Stephen,
>>>
>>> Using Solr 8.8.1 i tried to reproduce your strange problem, copied your
>>> schema and indexed a single document. As expected, i got exactly one
>>> result
>>> for all four combinations, also using both the default Lucene QParser and
>>> the Edismax QParser.
>>>
>>> So it appears to work just fine here on 8.8.1. The WordDelimeterGraph is
>>> relatively new and had only few issues. Maybe you can try to see if it
>>> works without the Graph-type token filters, using the old WordDelimeter
>>> That one is tried and tested.
>>>
>>> Regards,
>>> Markus
>>>
>>> Op vr 2 sep. 2022 om 21:57 schreef Stephen Lewis Bianamara <
>>> stephen.bianamara@gmail.com>:
>>>
>>> > Hey Solr Users,
>>> >
>>> > I've noticed an odd behavior between word graph delimiter and the sow
>>> > parameter. When the word graph delimiter gets invoked and sow=true,
>>> there
>>> > is the possibility to miss results which include alpha num splitting
>>> but
>>> > aren't exact matches. So if I have a document with "ABC123
>>> DEF456_GHI", the
>>> > combination of sow=true and WordDelimeterGraph seem to break queries
>>> for
>>> > "def456". See full repro below.
>>> >
>>> > I believe this is a bug. Could someone please take a look at my repro
>>> and
>>> > confirm my repro, or let me know if something is misconfigured here?
>>> >
>>> > *Repro*
>>> >
>>> >    - solr 9 with this field type definition for field "test_en"
>>> >
>>> > <fieldType name="text_en" class="solr.TextField"
>>> positionIncrementGap="100"
>>> > autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer
>>> class=
>>> > "solr.WhitespaceTokenizerFactory"/> <filter class=
>>> > "solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
>>> > generateNumberParts="1" catenateAll="1" preserveOriginal="1"
>>> > splitOnCaseChange="1"/> <filter
>>> class="solr.FlattenGraphFilterFactory"/> <
>>> > filter class="solr.LowerCaseFilterFactory"/> <filter class=
>>> > "solr.SnowballPorterFilterFactory"/> </analyzer> <analyzer
>>> type="query"> <
>>> > tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=
>>> > "solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
>>> > generateNumberParts="1" catenateAll="1" preserveOriginal="1"
>>> > splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <
>>> > filter class="solr.SnowballPorterFilterFactory"/> </analyzer>
>>> </fieldType>
>>> >
>>> >    - Create document {"id": 1, "test_en": ["ABC123 DEF456_GHI"]}
>>> >    - Query the following; all should hit, but one combination misses
>>> >       - sow=true, q=def456
>>> >          - misses
>>> >       - sow=true, q=abc123
>>> >          - hits
>>> >       - sow=false, q=def456
>>> >          - hits
>>> >       - sow=false, q=abc123
>>> >          - hits
>>> >
>>>
>>

Re: Problem with "sow" and WordGraphDelimeter

Posted by Alessandro Benedetti <a....@sease.io>.

FYI
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Fri, 9 Sept 2022 at 15:49, Alessandro Benedetti <a....@sease.io>
wrote:

> Not related to the word-delimiter token filter but I did a study a while
> ago on the sow parameter, identified a couple of bugs and fixed one (the
> other was discussed and in the end not accepted as an improvement as it was
> controversial).
>
>
> https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html
>
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Wed, 7 Sept 2022 at 14:19, Markus Jelsma <ma...@openindex.io>
> wrote:
>
>> Hello Stephen,
>>
>> Using Solr 8.8.1 i tried to reproduce your strange problem, copied your
>> schema and indexed a single document. As expected, i got exactly one
>> result
>> for all four combinations, also using both the default Lucene QParser and
>> the Edismax QParser.
>>
>> So it appears to work just fine here on 8.8.1. The WordDelimeterGraph is
>> relatively new and had only few issues. Maybe you can try to see if it
>> works without the Graph-type token filters, using the old WordDelimeter
>> That one is tried and tested.
>>
>> Regards,
>> Markus
>>
>> Op vr 2 sep. 2022 om 21:57 schreef Stephen Lewis Bianamara <
>> stephen.bianamara@gmail.com>:
>>
>> > Hey Solr Users,
>> >
>> > I've noticed an odd behavior between word graph delimiter and the sow
>> > parameter. When the word graph delimiter gets invoked and sow=true,
>> there
>> > is the possibility to miss results which include alpha num splitting but
>> > aren't exact matches. So if I have a document with "ABC123 DEF456_GHI",
>> the
>> > combination of sow=true and WordDelimeterGraph seem to break queries for
>> > "def456". See full repro below.
>> >
>> > I believe this is a bug. Could someone please take a look at my repro
>> and
>> > confirm my repro, or let me know if something is misconfigured here?
>> >
>> > *Repro*
>> >
>> >    - solr 9 with this field type definition for field "test_en"
>> >
>> > <fieldType name="text_en" class="solr.TextField"
>> positionIncrementGap="100"
>> > autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer
>> class=
>> > "solr.WhitespaceTokenizerFactory"/> <filter class=
>> > "solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
>> > generateNumberParts="1" catenateAll="1" preserveOriginal="1"
>> > splitOnCaseChange="1"/> <filter
>> class="solr.FlattenGraphFilterFactory"/> <
>> > filter class="solr.LowerCaseFilterFactory"/> <filter class=
>> > "solr.SnowballPorterFilterFactory"/> </analyzer> <analyzer
>> type="query"> <
>> > tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=
>> > "solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
>> > generateNumberParts="1" catenateAll="1" preserveOriginal="1"
>> > splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <
>> > filter class="solr.SnowballPorterFilterFactory"/> </analyzer>
>> </fieldType>
>> >
>> >    - Create document {"id": 1, "test_en": ["ABC123 DEF456_GHI"]}
>> >    - Query the following; all should hit, but one combination misses
>> >       - sow=true, q=def456
>> >          - misses
>> >       - sow=true, q=abc123
>> >          - hits
>> >       - sow=false, q=def456
>> >          - hits
>> >       - sow=false, q=abc123
>> >          - hits
>> >
>>
>

Re: Problem with "sow" and WordGraphDelimeter

Posted by Alessandro Benedetti <a....@sease.io>.

Not related to the word-delimiter token filter but I did a study a while
ago on the sow parameter, identified a couple of bugs and fixed one (the
other was discussed and in the end not accepted as an improvement as it was
controversial).

https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Wed, 7 Sept 2022 at 14:19, Markus Jelsma <ma...@openindex.io>
wrote:

> Hello Stephen,
>
> Using Solr 8.8.1 i tried to reproduce your strange problem, copied your
> schema and indexed a single document. As expected, i got exactly one result
> for all four combinations, also using both the default Lucene QParser and
> the Edismax QParser.
>
> So it appears to work just fine here on 8.8.1. The WordDelimeterGraph is
> relatively new and had only few issues. Maybe you can try to see if it
> works without the Graph-type token filters, using the old WordDelimeter
> That one is tried and tested.
>
> Regards,
> Markus
>
> Op vr 2 sep. 2022 om 21:57 schreef Stephen Lewis Bianamara <
> stephen.bianamara@gmail.com>:
>
> > Hey Solr Users,
> >
> > I've noticed an odd behavior between word graph delimiter and the sow
> > parameter. When the word graph delimiter gets invoked and sow=true, there
> > is the possibility to miss results which include alpha num splitting but
> > aren't exact matches. So if I have a document with "ABC123 DEF456_GHI",
> the
> > combination of sow=true and WordDelimeterGraph seem to break queries for
> > "def456". See full repro below.
> >
> > I believe this is a bug. Could someone please take a look at my repro and
> > confirm my repro, or let me know if something is misconfigured here?
> >
> > *Repro*
> >
> >    - solr 9 with this field type definition for field "test_en"
> >
> > <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100"
> > autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer
> class=
> > "solr.WhitespaceTokenizerFactory"/> <filter class=
> > "solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
> > generateNumberParts="1" catenateAll="1" preserveOriginal="1"
> > splitOnCaseChange="1"/> <filter class="solr.FlattenGraphFilterFactory"/>
> <
> > filter class="solr.LowerCaseFilterFactory"/> <filter class=
> > "solr.SnowballPorterFilterFactory"/> </analyzer> <analyzer type="query">
> <
> > tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=
> > "solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
> > generateNumberParts="1" catenateAll="1" preserveOriginal="1"
> > splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <
> > filter class="solr.SnowballPorterFilterFactory"/> </analyzer>
> </fieldType>
> >
> >    - Create document {"id": 1, "test_en": ["ABC123 DEF456_GHI"]}
> >    - Query the following; all should hit, but one combination misses
> >       - sow=true, q=def456
> >          - misses
> >       - sow=true, q=abc123
> >          - hits
> >       - sow=false, q=def456
> >          - hits
> >       - sow=false, q=abc123
> >          - hits
> >
>

Re: Problem with "sow" and WordGraphDelimeter

Posted by Markus Jelsma <ma...@openindex.io>.

Hello Stephen,

Using Solr 8.8.1 i tried to reproduce your strange problem, copied your
schema and indexed a single document. As expected, i got exactly one result
for all four combinations, also using both the default Lucene QParser and
the Edismax QParser.

So it appears to work just fine here on 8.8.1. The WordDelimeterGraph is
relatively new and had only few issues. Maybe you can try to see if it
works without the Graph-type token filters, using the old WordDelimeter
That one is tried and tested.

Regards,
Markus

Op vr 2 sep. 2022 om 21:57 schreef Stephen Lewis Bianamara <
stephen.bianamara@gmail.com>:

> Hey Solr Users,
>
> I've noticed an odd behavior between word graph delimiter and the sow
> parameter. When the word graph delimiter gets invoked and sow=true, there
> is the possibility to miss results which include alpha num splitting but
> aren't exact matches. So if I have a document with "ABC123 DEF456_GHI", the
> combination of sow=true and WordDelimeterGraph seem to break queries for
> "def456". See full repro below.
>
> I believe this is a bug. Could someone please take a look at my repro and
> confirm my repro, or let me know if something is misconfigured here?
>
> *Repro*
>
>    - solr 9 with this field type definition for field "test_en"
>
> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class=
> "solr.WhitespaceTokenizerFactory"/> <filter class=
> "solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateAll="1" preserveOriginal="1"
> splitOnCaseChange="1"/> <filter class="solr.FlattenGraphFilterFactory"/> <
> filter class="solr.LowerCaseFilterFactory"/> <filter class=
> "solr.SnowballPorterFilterFactory"/> </analyzer> <analyzer type="query"> <
> tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=
> "solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateAll="1" preserveOriginal="1"
> splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <
> filter class="solr.SnowballPorterFilterFactory"/> </analyzer> </fieldType>
>
>    - Create document {"id": 1, "test_en": ["ABC123 DEF456_GHI"]}
>    - Query the following; all should hit, but one combination misses
>       - sow=true, q=def456
>          - misses
>       - sow=true, q=abc123
>          - hits
>       - sow=false, q=def456
>          - hits
>       - sow=false, q=abc123
>          - hits
>