You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by TK Solr <tk...@sonic.net> on 2020/04/08 21:49:01 UTC

ReversedWildcardFilter - should it be applied only at the index time?

In the usage example shown in ReversedWildcardFilter 
<https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter> 
in Solr Ref Guide,
and only usage find in managed-schema to define text_general_rev, the filter is 
used only for indexing.

   <fieldType name="text_general_rev" class="solr.TextField" 
positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.ReversedWildcardFilterFactory" maxPosQuestion="2" 
maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.SynonymGraphFilterFactory" expand="true" 
ignoreCase="true" synonyms="synonyms.txt"/>
       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
   </fieldType>


Is it incorrect to use the same analyzer for query like?

   <fieldType name="lowercase_rev" class="solr.TextField" 
positionIncrementGap="100">
     <!-- Added to handle right-anchored substring match for email fields -->
     <analyzer>
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.ReversedWildcardFilterFactory" maxPosQuestion="0" 
maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>
     </analyzer>
   </fieldType>

In the description of filter, I see "Tokens without wildcards are not reversed."
But the wildcard appears only in the query string. How can 
ReversedWildcardFilter know if the wildcard is being used
if the filter is used only at the indexing time?

TK

Re: ReversedWildcardFilter - should it be applied only at the index time?

Posted by TK Solr <tk...@sonic.net>.

It doesn't tell much:

"debug":{ "rawquerystring":"email:*@aol.com", "querystring":"email:*@aol.com", 
"parsedquery":"(email:*@aol.com)", "parsedquery_toString":"email:*@aol.com", 
"explain":{ "11d6e092-58b5-4c1b-83bc-f3b37e0797fd":{ "match":true, "value":1.0, 
"description":"email:*@aol.com"},

The email field uses ReversedWildcardFilter for both indexing and query.

On 4/15/20 12:04 PM, Erick Erickson wrote:
> What do you see if you add &debug=query? That should tell you….
>
> Best,
> Erick
>
>> On Apr 15, 2020, at 2:40 PM, TK Solr <tk...@sonic.net> wrote:
>>
>> Thank you.
>>
>> Is there any harm if I use it on the query side too? In my case it seems working OK (even with withOriginal="false"), and even faster.
>> I see the query parser code is taking a look at index analyzer and applying ReversedWildcardFilter at query time. But I didn't
>> quite understand what happens if the query analyzer also uses ReversedWildcardFilter.
>>
>> On 4/15/20 1:51 AM, Colvin Cowie wrote:
>>> You only need apply it in the index analyzer:
>>> https://lucene.apache.org/solr/8_4_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
>>> If it appears in the index analyzer, the query part of it is automatically
>>> applied at query time.
>>>
>>> The ReversedWildcardFilter indexes *every* token in reverse, with a special
>>> character at the start ('\u0001' I believe) to avoid false positive matches
>>> when the query term isn't reversed (e.g. if the term being indexed is mar,
>>> then the reversed token would be \u0001ram, so a search for 'ram' wouldn't
>>> accidentally match that). If *withOriginal* is set to true then it will
>>> reverse the normal token as well as the reversed token.
>>>
>>>
>>> On Thu, 9 Apr 2020 at 02:27, TK Solr <tk...@sonic.net> wrote:
>>>
>>>> I experimented with the index-time only use of ReversedWildcardFilter and
>>>> the
>>>> both time use.
>>>>
>>>> My result shows using ReverseWildcardFilter both times runs twice as fast
>>>> but my
>>>> dataset is not very large (in the order of 10k docs), so I'm not sure if I
>>>> can
>>>> make a conclusion.
>>>>
>>>> On 4/8/20 2:49 PM, TK Solr wrote:
>>>>> In the usage example shown in ReversedWildcardFilter
>>>>> <
>>>> https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter>
>>>>
>>>>> in Solr Ref Guide,
>>>>> and only usage find in managed-schema to define text_general_rev, the
>>>> filter
>>>>> is used only for indexing.
>>>>>
>>>>> <fieldType name="text_general_rev" class="solr.TextField"
>>>>> positionIncrementGap="100">
>>>>> <analyzer type="index">
>>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>> <filter class="solr.StopFilterFactory" words="stopwords.txt"
>>>>> ignoreCase="true"/>
>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>> <filter class="solr.ReversedWildcardFilterFactory"
>>>> maxPosQuestion="2"
>>>>> maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
>>>>> </analyzer>
>>>>> <analyzer type="query">
>>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>> <filter class="solr.SynonymGraphFilterFactory" expand="true"
>>>>> ignoreCase="true" synonyms="synonyms.txt"/>
>>>>> <filter class="solr.StopFilterFactory" words="stopwords.txt"
>>>>> ignoreCase="true"/>
>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>> </analyzer>
>>>>> </fieldType>
>>>>>
>>>>>
>>>>> Is it incorrect to use the same analyzer for query like?
>>>>>
>>>>> <fieldType name="lowercase_rev" class="solr.TextField"
>>>>> positionIncrementGap="100">
>>>>> <!-- Added to handle right-anchored substring match for email fields
>>>> -->
>>>>> <analyzer>
>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>> <filter class="solr.ReversedWildcardFilterFactory"
>>>> maxPosQuestion="0"
>>>>> maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>
>>>>> </analyzer>
>>>>> </fieldType>
>>>>>
>>>>> In the description of filter, I see "Tokens without wildcards are not
>>>> reversed."
>>>>> But the wildcard appears only in the query string. How can
>>>>> ReversedWildcardFilter know if the wildcard is being used
>>>>> if the filter is used only at the indexing time?
>>>>>
>>>>> TK
>>>>>
>>>>>
>

Re: ReversedWildcardFilter - should it be applied only at the index time?

Posted by Erick Erickson <er...@gmail.com>.

What do you see if you add &debug=query? That should tell you….

Best,
Erick

> On Apr 15, 2020, at 2:40 PM, TK Solr <tk...@sonic.net> wrote:
> 
> Thank you.
> 
> Is there any harm if I use it on the query side too? In my case it seems working OK (even with withOriginal="false"), and even faster.
> I see the query parser code is taking a look at index analyzer and applying ReversedWildcardFilter at query time. But I didn't
> quite understand what happens if the query analyzer also uses ReversedWildcardFilter.
> 
> On 4/15/20 1:51 AM, Colvin Cowie wrote:
>> You only need apply it in the index analyzer:
>> https://lucene.apache.org/solr/8_4_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
>> If it appears in the index analyzer, the query part of it is automatically
>> applied at query time.
>> 
>> The ReversedWildcardFilter indexes *every* token in reverse, with a special
>> character at the start ('\u0001' I believe) to avoid false positive matches
>> when the query term isn't reversed (e.g. if the term being indexed is mar,
>> then the reversed token would be \u0001ram, so a search for 'ram' wouldn't
>> accidentally match that). If *withOriginal* is set to true then it will
>> reverse the normal token as well as the reversed token.
>> 
>> 
>> On Thu, 9 Apr 2020 at 02:27, TK Solr <tk...@sonic.net> wrote:
>> 
>>> I experimented with the index-time only use of ReversedWildcardFilter and
>>> the
>>> both time use.
>>> 
>>> My result shows using ReverseWildcardFilter both times runs twice as fast
>>> but my
>>> dataset is not very large (in the order of 10k docs), so I'm not sure if I
>>> can
>>> make a conclusion.
>>> 
>>> On 4/8/20 2:49 PM, TK Solr wrote:
>>>> In the usage example shown in ReversedWildcardFilter
>>>> <
>>> https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter>
>>> 
>>>> in Solr Ref Guide,
>>>> and only usage find in managed-schema to define text_general_rev, the
>>> filter
>>>> is used only for indexing.
>>>> 
>>>> <fieldType name="text_general_rev" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>> <analyzer type="index">
>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>> <filter class="solr.StopFilterFactory" words="stopwords.txt"
>>>> ignoreCase="true"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.ReversedWildcardFilterFactory"
>>> maxPosQuestion="2"
>>>> maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
>>>> </analyzer>
>>>> <analyzer type="query">
>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>> <filter class="solr.SynonymGraphFilterFactory" expand="true"
>>>> ignoreCase="true" synonyms="synonyms.txt"/>
>>>> <filter class="solr.StopFilterFactory" words="stopwords.txt"
>>>> ignoreCase="true"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> </analyzer>
>>>> </fieldType>
>>>> 
>>>> 
>>>> Is it incorrect to use the same analyzer for query like?
>>>> 
>>>> <fieldType name="lowercase_rev" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>> <!-- Added to handle right-anchored substring match for email fields
>>> -->
>>>> <analyzer>
>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.ReversedWildcardFilterFactory"
>>> maxPosQuestion="0"
>>>> maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>
>>>> </analyzer>
>>>> </fieldType>
>>>> 
>>>> In the description of filter, I see "Tokens without wildcards are not
>>> reversed."
>>>> But the wildcard appears only in the query string. How can
>>>> ReversedWildcardFilter know if the wildcard is being used
>>>> if the filter is used only at the indexing time?
>>>> 
>>>> TK
>>>> 
>>>>

Re: ReversedWildcardFilter - should it be applied only at the index time?

Posted by TK Solr <tk...@sonic.net>.

Thank you.

Is there any harm if I use it on the query side too? In my case it seems working 
OK (even with withOriginal="false"), and even faster.
I see the query parser code is taking a look at index analyzer and applying 
ReversedWildcardFilter at query time. But I didn't
quite understand what happens if the query analyzer also uses 
ReversedWildcardFilter.

On 4/15/20 1:51 AM, Colvin Cowie wrote:
> You only need apply it in the index analyzer:
> https://lucene.apache.org/solr/8_4_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
> If it appears in the index analyzer, the query part of it is automatically
> applied at query time.
>
> The ReversedWildcardFilter indexes *every* token in reverse, with a special
> character at the start ('\u0001' I believe) to avoid false positive matches
> when the query term isn't reversed (e.g. if the term being indexed is mar,
> then the reversed token would be \u0001ram, so a search for 'ram' wouldn't
> accidentally match that). If *withOriginal* is set to true then it will
> reverse the normal token as well as the reversed token.
>
>
> On Thu, 9 Apr 2020 at 02:27, TK Solr <tk...@sonic.net> wrote:
>
>> I experimented with the index-time only use of ReversedWildcardFilter and
>> the
>> both time use.
>>
>> My result shows using ReverseWildcardFilter both times runs twice as fast
>> but my
>> dataset is not very large (in the order of 10k docs), so I'm not sure if I
>> can
>> make a conclusion.
>>
>> On 4/8/20 2:49 PM, TK Solr wrote:
>>> In the usage example shown in ReversedWildcardFilter
>>> <
>> https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter>
>>
>>> in Solr Ref Guide,
>>> and only usage find in managed-schema to define text_general_rev, the
>> filter
>>> is used only for indexing.
>>>
>>> <fieldType name="text_general_rev" class="solr.TextField"
>>> positionIncrementGap="100">
>>> <analyzer type="index">
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <filter class="solr.StopFilterFactory" words="stopwords.txt"
>>> ignoreCase="true"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.ReversedWildcardFilterFactory"
>> maxPosQuestion="2"
>>> maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
>>> </analyzer>
>>> <analyzer type="query">
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <filter class="solr.SynonymGraphFilterFactory" expand="true"
>>> ignoreCase="true" synonyms="synonyms.txt"/>
>>> <filter class="solr.StopFilterFactory" words="stopwords.txt"
>>> ignoreCase="true"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>>
>>> Is it incorrect to use the same analyzer for query like?
>>>
>>> <fieldType name="lowercase_rev" class="solr.TextField"
>>> positionIncrementGap="100">
>>> <!-- Added to handle right-anchored substring match for email fields
>> -->
>>> <analyzer>
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.ReversedWildcardFilterFactory"
>> maxPosQuestion="0"
>>> maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> In the description of filter, I see "Tokens without wildcards are not
>> reversed."
>>> But the wildcard appears only in the query string. How can
>>> ReversedWildcardFilter know if the wildcard is being used
>>> if the filter is used only at the indexing time?
>>>
>>> TK
>>>
>>>

Re: ReversedWildcardFilter - should it be applied only at the index time?

Posted by Colvin Cowie <co...@gmail.com>.

You only need apply it in the index analyzer:
https://lucene.apache.org/solr/8_4_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
If it appears in the index analyzer, the query part of it is automatically
applied at query time.

The ReversedWildcardFilter indexes *every* token in reverse, with a special
character at the start ('\u0001' I believe) to avoid false positive matches
when the query term isn't reversed (e.g. if the term being indexed is mar,
then the reversed token would be \u0001ram, so a search for 'ram' wouldn't
accidentally match that). If *withOriginal* is set to true then it will
reverse the normal token as well as the reversed token.


On Thu, 9 Apr 2020 at 02:27, TK Solr <tk...@sonic.net> wrote:

> I experimented with the index-time only use of ReversedWildcardFilter and
> the
> both time use.
>
> My result shows using ReverseWildcardFilter both times runs twice as fast
> but my
> dataset is not very large (in the order of 10k docs), so I'm not sure if I
> can
> make a conclusion.
>
> On 4/8/20 2:49 PM, TK Solr wrote:
> > In the usage example shown in ReversedWildcardFilter
> > <
> https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter>
>
> > in Solr Ref Guide,
> > and only usage find in managed-schema to define text_general_rev, the
> filter
> > is used only for indexing.
> >
> >   <fieldType name="text_general_rev" class="solr.TextField"
> > positionIncrementGap="100">
> >     <analyzer type="index">
> >       <tokenizer class="solr.StandardTokenizerFactory"/>
> >       <filter class="solr.StopFilterFactory" words="stopwords.txt"
> > ignoreCase="true"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <filter class="solr.ReversedWildcardFilterFactory"
> maxPosQuestion="2"
> > maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
> >     </analyzer>
> >     <analyzer type="query">
> >       <tokenizer class="solr.StandardTokenizerFactory"/>
> >       <filter class="solr.SynonymGraphFilterFactory" expand="true"
> > ignoreCase="true" synonyms="synonyms.txt"/>
> >       <filter class="solr.StopFilterFactory" words="stopwords.txt"
> > ignoreCase="true"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >     </analyzer>
> >   </fieldType>
> >
> >
> > Is it incorrect to use the same analyzer for query like?
> >
> >   <fieldType name="lowercase_rev" class="solr.TextField"
> > positionIncrementGap="100">
> >     <!-- Added to handle right-anchored substring match for email fields
> -->
> >     <analyzer>
> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <filter class="solr.ReversedWildcardFilterFactory"
> maxPosQuestion="0"
> > maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>
> >     </analyzer>
> >   </fieldType>
> >
> > In the description of filter, I see "Tokens without wildcards are not
> reversed."
> > But the wildcard appears only in the query string. How can
> > ReversedWildcardFilter know if the wildcard is being used
> > if the filter is used only at the indexing time?
> >
> > TK
> >
> >
>

Re: ReversedWildcardFilter - should it be applied only at the index time?

Posted by TK Solr <tk...@sonic.net>.

I experimented with the index-time only use of ReversedWildcardFilter and the 
both time use.

My result shows using ReverseWildcardFilter both times runs twice as fast but my 
dataset is not very large (in the order of 10k docs), so I'm not sure if I can 
make a conclusion.

On 4/8/20 2:49 PM, TK Solr wrote:
> In the usage example shown in ReversedWildcardFilter 
> <https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter> 
> in Solr Ref Guide,
> and only usage find in managed-schema to define text_general_rev, the filter 
> is used only for indexing.
>
>   <fieldType name="text_general_rev" class="solr.TextField" 
> positionIncrementGap="100">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ReversedWildcardFilterFactory" maxPosQuestion="2" 
> maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.SynonymGraphFilterFactory" expand="true" 
> ignoreCase="true" synonyms="synonyms.txt"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
>
> Is it incorrect to use the same analyzer for query like?
>
>   <fieldType name="lowercase_rev" class="solr.TextField" 
> positionIncrementGap="100">
>     <!-- Added to handle right-anchored substring match for email fields -->
>     <analyzer>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ReversedWildcardFilterFactory" maxPosQuestion="0" 
> maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>
>     </analyzer>
>   </fieldType>
>
> In the description of filter, I see "Tokens without wildcards are not reversed."
> But the wildcard appears only in the query string. How can 
> ReversedWildcardFilter know if the wildcard is being used
> if the filter is used only at the indexing time?
>
> TK
>
>