You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Zimmermann, Thomas" <tz...@techtarget.com> on 2018/08/15 18:43:26 UTC

Is Running the Same Filters on Index and Query Redundant?

Hi,

We have the text field below configured on fields that are both stored and indexed. It seems to me that applying the same filters on both index and query would be redundant, and perhaps a waste of processing on the retrieval side if the filter work was already done on the index side. Is this a fair statement to make? Should I only be applying filters on one end of the transaction?

Thanks,
TZ


   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <charFilter class="solr.HTMLStripCharFilterFactory"/>

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <charFilter class="solr.HTMLStripCharFilterFactory"/>

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

    </fieldType>

Re: Is Running the Same Filters on Index and Query Redundant?

Posted by Erick Erickson <er...@gmail.com>.

Thomas:

If you go to the admin UI, pick a collection (or core) and go to  the
"analysis" page. Put different values in the "index" and "query" entry
boxes. Sometimes a picture is worth a thousand words ;).

And, indeed, synonyms are one of the prime filters that are often
different between the two phases. And do be a little careful about
WordDelimiterGraphFilterFactory, it's subtly different in the
examples, particularly  catenateWords="1" catenateNumbers="1" and
catenateWords="0" catenateNumbers="0" in index and query,
respectively. For catenateWords, the result for wi-fi
would be to index
wi
fi
wifi

but at index time you'd only get
wi
fi

but that's OK since those tokens are already in the index, as is
"wifi" if the search was "wifi"

Do note that when the filters for query and index _are_ identical, say
something like whitespaceTokenizer + lowercaseFilter, you can indeed
define only one, just leave off the "phase",
   <analyzer> rather than    <analyzer type="query">   ... <analyzer
type="index">

Meanwhile Andrea has taken care of you I see...

Best,
Erick

On Wed, Aug 15, 2018 at 12:17 PM, Andrea Gazzarini <a....@sease.io> wrote:
> Hi Thomas,
> as you know, the two analyzers play in a different moment, with a different
> input and a different goal for the corresponding output:
>
>  * index analyzer: input is a field value, output is used for building
>    the index
>  * query analyzer: input is a (user) query string, output is used for
>    building a (Solr) query
>
> At index time a term dictionary is built, and a retrieval time the output
> query tries to find a match in that dictionary. I wouldn't call it
> "redundancy" because even if the filter is the same, it is applied to a
> different input and it has a different goal.
>
> Some filters must be present both at index at query time because otherwise
> you won't find any match: if you put a lowercase filter only on the index
> side, queries with uppercase chars won't find any match. Some others don't
> (one example is the SynonymGraphFilter you've used only at query time). In
> general, everything depends on your needs and it's perfectly valid to have
> symmetric (index analyzer = query analyzer) and asymmetric text analysis
> (index analyzer != query analyzer).
>
> Without knowing your context is very hard to guess if there's something
> wrong in the configuration. What is the part of the analyzers you think is
> redundant?
>
> On top of that: in your chain the HTMLStripCharFilterFactory applied at
> query time is something unusual, because while it makes perfectly sense at
> index time (where I guess you index some HTML source), at query time I can't
> imagine a scenario where the user inputs queries containing HTML tags.
>
> Best,
> Andrea
>
>
> On 15/08/18 20:43, Zimmermann, Thomas wrote:
>>
>> Hi,
>>
>> We have the text field below configured on fields that are both stored and
>> indexed. It seems to me that applying the same filters on both index and
>> query would be redundant, and perhaps a waste of processing on the retrieval
>> side if the filter work was already done on the index side. Is this a fair
>> statement to make? Should I only be applying filters on one end of the
>> transaction?
>>
>> Thanks,
>> TZ
>>
>>
>>     <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>>
>>        <analyzer type="index">
>>
>>          <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>
>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>
>>          <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>
>>        </analyzer>
>>
>>        <analyzer type="query">
>>
>>          <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>
>>          <filter class="solr.SynonymGraphFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>
>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>>
>>          <filter class="solr.WordDelimiterGraphFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>
>>          <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>
>>        </analyzer>
>>
>>      </fieldType>
>>
>>
>>
>

Re: Is Running the Same Filters on Index and Query Redundant?

Posted by Andrea Gazzarini <a....@sease.io>.

You're welcome, great to hear you have less doubts.
I see you're using the SynonymGraphFilter followed by a StopFilter at 
query time: have a look at this post [1], you might find some useful info.

Best,
Andrea

[1] https://sease.io/2018/07/combining-synonyms-and-stopwords.html

On 15/08/18 21:47, Zimmermann, Thomas wrote:
> Hi Andrea,
>
> Thanks so much. I wasn¹t thinking in the correct perspective on the query
> portion of the analyzer, but your explanation makes perfect sense. In my
> head I imagine the result set of the query being transformed by the
> filters, but in actuality the filter is being applied to the query itself
> before processing. This makes sense on my end and I think it answer my
> questions.
>
> Excellent point on the html strip factory. I¹ll evaluate our use cases.
>
> This was all brought about by switching from the deprecated synonym and
> word delimiter factories to the new graph based factories, where we
> stopped filtering on insert for those and switched to filtering on query
> based on recommendations from the Solr Doc.
>
> Thanks,
> TZ
>
> On 8/15/18, 3:17 PM, "Andrea Gazzarini" <a....@sease.io> wrote:
>
>> Hi Thomas,
>> as you know, the two analyzers play in a different moment, with a
>> different input and a different goal for the corresponding output:
>>
>>   * index analyzer: input is a field value, output is used for building
>>     the index
>>   * query analyzer: input is a (user) query string, output is used for
>>     building a (Solr) query
>>
>> At index time a term dictionary is built, and a retrieval time the
>> output query tries to find a match in that dictionary. I wouldn't call
>> it "redundancy" because even if the filter is the same, it is applied to
>> a different input and it has a different goal.
>>
>> Some filters must be present both at index at query time because
>> otherwise you won't find any match: if you put a lowercase filter only
>> on the index side, queries with uppercase chars won't find any match.
>> Some others don't (one example is the SynonymGraphFilter you've used
>> only at query time). In general, everything depends on your needs and
>> it's perfectly valid to have symmetric (index analyzer = query analyzer)
>> and asymmetric text analysis (index analyzer != query analyzer).
>>
>> Without knowing your context is very hard to guess if there's something
>> wrong in the configuration. What is the part of the analyzers you think
>> is redundant?
>>
>> On top of that: in your chain the HTMLStripCharFilterFactory applied at
>> query time is something unusual, because while it makes perfectly sense
>> at index time (where I guess you index some HTML source), at query time
>> I can't imagine a scenario where the user inputs queries containing HTML
>> tags.
>>
>> Best,
>> Andrea
>>
>> On 15/08/18 20:43, Zimmermann, Thomas wrote:
>>> Hi,
>>>
>>> We have the text field below configured on fields that are both stored
>>> and indexed. It seems to me that applying the same filters on both index
>>> and query would be redundant, and perhaps a waste of processing on the
>>> retrieval side if the filter work was already done on the index side. Is
>>> this a fair statement to make? Should I only be applying filters on one
>>> end of the transaction?
>>>
>>> Thanks,
>>> TZ
>>>
>>>
>>>      <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">
>>>
>>>         <analyzer type="index">
>>>
>>>           <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>
>>>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>
>>>           <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" />
>>>
>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>
>>>           <filter class="solr.SnowballPorterFilterFactory"
>>> language="English" protected="protwords.txt"/>
>>>
>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>
>>>         </analyzer>
>>>
>>>         <analyzer type="query">
>>>
>>>           <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>
>>>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>
>>>           <filter class="solr.SynonymGraphFilterFactory"
>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>>
>>>           <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" />
>>>
>>>           <filter class="solr.WordDelimiterGraphFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>
>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>
>>>           <filter class="solr.SnowballPorterFilterFactory"
>>> language="English" protected="protwords.txt"/>
>>>
>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>
>>>         </analyzer>
>>>
>>>       </fieldType>
>>>
>>>
>>>

Re: Is Running the Same Filters on Index and Query Redundant?

Posted by "Zimmermann, Thomas" <tz...@techtarget.com>.

Hi Andrea,

Thanks so much. I wasn¹t thinking in the correct perspective on the query
portion of the analyzer, but your explanation makes perfect sense. In my
head I imagine the result set of the query being transformed by the
filters, but in actuality the filter is being applied to the query itself
before processing. This makes sense on my end and I think it answer my
questions. 

Excellent point on the html strip factory. I¹ll evaluate our use cases.

This was all brought about by switching from the deprecated synonym and
word delimiter factories to the new graph based factories, where we
stopped filtering on insert for those and switched to filtering on query
based on recommendations from the Solr Doc.

Thanks,
TZ

On 8/15/18, 3:17 PM, "Andrea Gazzarini" <a....@sease.io> wrote:

>Hi Thomas,
>as you know, the two analyzers play in a different moment, with a
>different input and a different goal for the corresponding output:
>
>  * index analyzer: input is a field value, output is used for building
>    the index
>  * query analyzer: input is a (user) query string, output is used for
>    building a (Solr) query
>
>At index time a term dictionary is built, and a retrieval time the
>output query tries to find a match in that dictionary. I wouldn't call
>it "redundancy" because even if the filter is the same, it is applied to
>a different input and it has a different goal.
>
>Some filters must be present both at index at query time because
>otherwise you won't find any match: if you put a lowercase filter only
>on the index side, queries with uppercase chars won't find any match.
>Some others don't (one example is the SynonymGraphFilter you've used
>only at query time). In general, everything depends on your needs and
>it's perfectly valid to have symmetric (index analyzer = query analyzer)
>and asymmetric text analysis (index analyzer != query analyzer).
>
>Without knowing your context is very hard to guess if there's something
>wrong in the configuration. What is the part of the analyzers you think
>is redundant?
>
>On top of that: in your chain the HTMLStripCharFilterFactory applied at
>query time is something unusual, because while it makes perfectly sense
>at index time (where I guess you index some HTML source), at query time
>I can't imagine a scenario where the user inputs queries containing HTML
>tags.
>
>Best,
>Andrea
>
>On 15/08/18 20:43, Zimmermann, Thomas wrote:
>> Hi,
>>
>> We have the text field below configured on fields that are both stored
>>and indexed. It seems to me that applying the same filters on both index
>>and query would be redundant, and perhaps a waste of processing on the
>>retrieval side if the filter work was already done on the index side. Is
>>this a fair statement to make? Should I only be applying filters on one
>>end of the transaction?
>>
>> Thanks,
>> TZ
>>
>>
>>     <fieldType name="text" class="solr.TextField"
>>positionIncrementGap="100">
>>
>>        <analyzer type="index">
>>
>>          <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>
>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>words="stopwords.txt" />
>>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>
>>          <filter class="solr.SnowballPorterFilterFactory"
>>language="English" protected="protwords.txt"/>
>>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>
>>        </analyzer>
>>
>>        <analyzer type="query">
>>
>>          <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>
>>          <filter class="solr.SynonymGraphFilterFactory"
>>synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>
>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>words="stopwords.txt" />
>>
>>          <filter class="solr.WordDelimiterGraphFilterFactory"
>>generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>
>>          <filter class="solr.SnowballPorterFilterFactory"
>>language="English" protected="protwords.txt"/>
>>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>
>>        </analyzer>
>>
>>      </fieldType>
>>
>>
>>
>

Re: Is Running the Same Filters on Index and Query Redundant?

Posted by Andrea Gazzarini <a....@sease.io>.

Hi Thomas,
as you know, the two analyzers play in a different moment, with a 
different input and a different goal for the corresponding output:

  * index analyzer: input is a field value, output is used for building
    the index
  * query analyzer: input is a (user) query string, output is used for
    building a (Solr) query

At index time a term dictionary is built, and a retrieval time the 
output query tries to find a match in that dictionary. I wouldn't call 
it "redundancy" because even if the filter is the same, it is applied to 
a different input and it has a different goal.

Some filters must be present both at index at query time because 
otherwise you won't find any match: if you put a lowercase filter only 
on the index side, queries with uppercase chars won't find any match. 
Some others don't (one example is the SynonymGraphFilter you've used 
only at query time). In general, everything depends on your needs and 
it's perfectly valid to have symmetric (index analyzer = query analyzer) 
and asymmetric text analysis (index analyzer != query analyzer).

Without knowing your context is very hard to guess if there's something 
wrong in the configuration. What is the part of the analyzers you think 
is redundant?

On top of that: in your chain the HTMLStripCharFilterFactory applied at 
query time is something unusual, because while it makes perfectly sense 
at index time (where I guess you index some HTML source), at query time 
I can't imagine a scenario where the user inputs queries containing HTML 
tags.

Best,
Andrea

On 15/08/18 20:43, Zimmermann, Thomas wrote:
> Hi,
>
> We have the text field below configured on fields that are both stored and indexed. It seems to me that applying the same filters on both index and query would be redundant, and perhaps a waste of processing on the retrieval side if the filter work was already done on the index side. Is this a fair statement to make? Should I only be applying filters on one end of the transaction?
>
> Thanks,
> TZ
>
>
>     <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>
>        <analyzer type="index">
>
>          <charFilter class="solr.HTMLStripCharFilterFactory"/>
>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>
>          <filter class="solr.LowerCaseFilterFactory"/>
>
>          <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>        </analyzer>
>
>        <analyzer type="query">
>
>          <charFilter class="solr.HTMLStripCharFilterFactory"/>
>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>          <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>
>          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>
>          <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>
>          <filter class="solr.LowerCaseFilterFactory"/>
>
>          <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>        </analyzer>
>
>      </fieldType>
>
>
>