You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by gnandre <ar...@gmail.com> on 2021/04/21 23:33:58 UTC

WordDelimiter does not generate expected token

Hi,

I have a field value as bim.ClassUnderlying and a search query as
classunderlying does not return any results. If I search for
classUnderlying, it works.What can I change so that it works for
classunderlying query too? If I change splitOnCaseChange value from 1 to 0
in index time analyzer chain, then it works but I don't want to do it
because I want to extract class and underlying tokens too from
classUnderlying word.

Following is my field type definition.
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" protected="protect.txt"
preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class=
"solr.LowerCaseFilterFactory"/> <filter class=
"solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose"/> <filter
class="solr.SynonymGraphFilterFactory" synonyms="synonyms_en.txt" ignoreCase
="true" expand="true"/> <filter class="solr.FlattenGraphFilterFactory"/> <
filter class="solr.KStemFilterFactory"/> <filter class=
"solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type=
"query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=
"solr.WordDelimiterGraphFilterFactory" protected="protect.txt"
preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
/> <filter class="solr.LowerCaseFilterFactory"/> <filter class=
"solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose"/> <filter
class="solr.SynonymGraphFilterFactory" synonyms="synonyms_en_query.txt"
ignoreCase="true" expand="true"/> <filter class="solr.KStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </
fieldType>

Re: WordDelimiter does not generate expected token

Posted by Michael Gibney <mi...@michaelgibney.net>.

WDGF with both "generate*Parts"/"splitOn" _and_
"catenate*"/"perserveOriginal" generates a graph TokenStream structure that
relies on PositionLengthAttribute to accurately reflect the graph
structure. Because Lucene does not index PositionLengthAttribute, this
information is lost when WDGF is used at index-time (resulting in the kind
of strange searching behavior you're observing). As a workaround, I would
recommend indexing into (and searching against) two fields: one with
index-time WDGF applying only "split"-type manipulations, one with
index-time WDGF applying only "catenate"-style operations. Another
alternative (making different compromises) would be to increase query-time
`ps` (phrase slop) to a value large enough to accommodate the "graph edges"
omitted from the Lucene index.

(Note that if you have multi-term synonyms at index-time, analogous issues
apply).

Some further relevant issues/blog posts:
https://issues.apache.org/jira/browse/LUCENE-4312
https://issues.apache.org/jira/browse/LUCENE-7398

https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
https://michaelgibney.net/lucene/graph/


On Wed, Apr 21, 2021 at 7:34 PM gnandre <ar...@gmail.com> wrote:

> Hi,
>
> I have a field value as bim.ClassUnderlying and a search query as
> classunderlying does not return any results. If I search for
> classUnderlying, it works.What can I change so that it works for
> classunderlying query too? If I change splitOnCaseChange value from 1 to 0
> in index time analyzer chain, then it works but I don't want to do it
> because I want to extract class and underlying tokens too from
> classUnderlying word.
>
> Following is my field type definition.
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index"> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterGraphFilterFactory"
> protected="protect.txt"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> /> <filter class="solr.FlattenGraphFilterFactory"/> <filter class=
> "solr.LowerCaseFilterFactory"/> <filter class=
> "solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose"/> <filter
> class="solr.SynonymGraphFilterFactory" synonyms="synonyms_en.txt"
> ignoreCase
> ="true" expand="true"/> <filter class="solr.FlattenGraphFilterFactory"/> <
> filter class="solr.KStemFilterFactory"/> <filter class=
> "solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type=
> "query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter
> class=
> "solr.WordDelimiterGraphFilterFactory" protected="protect.txt"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> /> <filter class="solr.LowerCaseFilterFactory"/> <filter class=
> "solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose"/> <filter
> class="solr.SynonymGraphFilterFactory" synonyms="synonyms_en_query.txt"
> ignoreCase="true" expand="true"/> <filter class="solr.KStemFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </
> fieldType>
>