You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeroen Steggink | knowsy <je...@knowsy.nl> on 2020/02/19 15:27:22 UTC

Phrase search and WordDelimiterGraphFilter not working as expected with mixed delimited and non-delimited tokens

Hi,

I have a question regarding phrase search in combination with a 
WordDelimiterGraphFilter (Solr 8.4.1).

Whenever I try to search using a phrase where token combination consists 
of delimited and non-delimited tokens, I don't get any matches.

This is the configuration:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.ASCIIFoldingFilterFactory"/>
         <filter class="solr.WordDelimiterGraphFilterFactory"
                 generateWordParts="1"
                 generateNumberParts="1"
                 catenateWords="1"
                 catenateNumbers="0"
                 catenateAll="0"
                 splitOnCaseChange="1"
                 preserveOriginal="1"/>
         <filter class="solr.FlattenGraphFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.ASCIIFoldingFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
</fieldType>

<field name="text" type="text" indexed="true" stored="true" omitTermFreqAndPositions="false" />


Example document:

{
   id: '1',
   text: 'mr. i.n.i.t. firstsirname secondsirname'
}

Queries and results:

Query:
"mr. i.n.i.t. firstsirname"
-----
No result

Query:
"mr. i.n.i.t."
-----
Result

Query:
"mr. i n i t"
-----
Result

Query:
"mr. init"
-----
Result

Query:
"mr init"
-----
Result

Query:
"i.n.i.t. firstsirname"
-----
No result

Query:
"init firstsirname"
-----
No result

Query:
"i.n.i.t. firstsirname secondsirname"
-----
No result

Query:
"init firstsirname secondsirname"
-----
No result


I don't quite understand why this is. When looking at the results of the 
analyzers I don't understand why it's working with just delimited or 
non-delimited tokens. However, as soon as the mixed combination of 
delimited and non-delimited is searched, there is no match.

Could someone explain? And is there a solution to make it work?

Best regards,

Jeroen



Re: Phrase search and WordDelimiterGraphFilter not working as expected with mixed delimited and non-delimited tokens

Posted by Michael Gibney <mi...@michaelgibney.net>.
There are many layers to this, but for the config you posted (applying
index-time WDGF configured to both split and catentate tokens), the
fundamental issue is that Lucene doesn't index positionLength, so the
graph structure (and token adjacency information) of the token stream
is lost when it's serialized to the index. Once the positionLength
information is discarded, it's impossible to restore/leverage it at
query time.

For now, if you use WGDF (or any analysis component capable of
generating "graph"-type output) at index-time, you'll have issues
unless you configure it such that it won't in practice generate graph
output. For WGDF this would mean either catenate output, or split
output, but not both on a single analysis chain. If you need both, one
option would be to index to (and search on) two fields: one for
catentated analysis, one for split analysis.

Graph output *is* respected at query-time, so you have more options
configuring WGDF on a query-time analyzer. But in that case, it's
worth being aware of the potential for exponential query expansion
(see discussion at https://issues.apache.org/jira/browse/SOLR-13336,
which restores a safety valve for extreme instances of this case).

Some other potentially relevant issues/links:
https://issues.apache.org/jira/browse/LUCENE-4312
https://issues.apache.org/jira/browse/LUCENE-7398
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
(Lucene, so applies also to Solr)
https://michaelgibney.net/lucene/graph/

On Wed, Feb 19, 2020 at 10:27 AM Jeroen Steggink | knowsy
<je...@knowsy.nl> wrote:
>
> Hi,
>
> I have a question regarding phrase search in combination with a
> WordDelimiterGraphFilter (Solr 8.4.1).
>
> Whenever I try to search using a phrase where token combination consists
> of delimited and non-delimited tokens, I don't get any matches.
>
> This is the configuration:
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>        <analyzer type="index">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.ASCIIFoldingFilterFactory"/>
>          <filter class="solr.WordDelimiterGraphFilterFactory"
>                  generateWordParts="1"
>                  generateNumberParts="1"
>                  catenateWords="1"
>                  catenateNumbers="0"
>                  catenateAll="0"
>                  splitOnCaseChange="1"
>                  preserveOriginal="1"/>
>          <filter class="solr.FlattenGraphFilterFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>        </analyzer>
>        <analyzer type="query">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.ASCIIFoldingFilterFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>        </analyzer>
> </fieldType>
>
> <field name="text" type="text" indexed="true" stored="true" omitTermFreqAndPositions="false" />
>
>
> Example document:
>
> {
>    id: '1',
>    text: 'mr. i.n.i.t. firstsirname secondsirname'
> }
>
> Queries and results:
>
> Query:
> "mr. i.n.i.t. firstsirname"
> -----
> No result
>
> Query:
> "mr. i.n.i.t."
> -----
> Result
>
> Query:
> "mr. i n i t"
> -----
> Result
>
> Query:
> "mr. init"
> -----
> Result
>
> Query:
> "mr init"
> -----
> Result
>
> Query:
> "i.n.i.t. firstsirname"
> -----
> No result
>
> Query:
> "init firstsirname"
> -----
> No result
>
> Query:
> "i.n.i.t. firstsirname secondsirname"
> -----
> No result
>
> Query:
> "init firstsirname secondsirname"
> -----
> No result
>
>
> I don't quite understand why this is. When looking at the results of the
> analyzers I don't understand why it's working with just delimited or
> non-delimited tokens. However, as soon as the mixed combination of
> delimited and non-delimited is searched, there is no match.
>
> Could someone explain? And is there a solution to make it work?
>
> Best regards,
>
> Jeroen
>
>