You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Wareham <ch...@graduate-jobs.com> on 2019/01/28 11:02:24 UTC

PatternReplaceFilterFactory problem

I'm trying to index some data which often includes domain names. I'd 
like to remove the .com TLD, so I have modified the text_en field type 
by adding a PatternReplaceFilterFactory filter. However, it doesn't 
appear to be working as a search for "text:(mydomain.com)" matches 
records but "text:(mydomain)" does not.

   <fieldType name="text_en" class="solr.TextField" 
positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.SynonymGraphFilterFactory" expand="true" 
ignoreCase="true" synonyms="synonyms.txt"/>
       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.PatternReplaceFilterFactory" 
pattern="([-a-z])\.com" replacement="$1"/>
       <filter class="solr.EnglishPossessiveFilterFactory"/>
       <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
       <filter class="solr.PorterStemFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.SynonymGraphFilterFactory" expand="true" 
ignoreCase="true" synonyms="synonyms.txt"/>
       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.PatternReplaceFilterFactory" 
pattern="([-a-z])\.com" replacement="$1"/>
       <filter class="solr.EnglishPossessiveFilterFactory"/>
       <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
       <filter class="solr.PorterStemFilterFactory"/>
     </analyzer>
   </fieldType>

The actual field definitions are as follows:

   <field name="companyName"      type="text_en"      indexed="true" 
stored="true"  required="true"             />
   <field name="jobTitle"         type="text_en"      indexed="true" 
stored="true"  required="true"             />
   <field name="text"             type="text_general" indexed="true" 
stored="false"                             />

   <copyField source="companyName" dest="text" />
   <copyField source="jobTitle"    dest="text" />

Re: PatternReplaceFilterFactory problem

Posted by Chris Wareham <ch...@graduate-jobs.com>.
Thanks for the help - changing the field type of the destination for the 
copy fields to "text_en" solved the problem. I'd foolishly assumed that 
the analysis of the source fields was applied then the resulting tokens 
passed to the copy field, which doesn't really make sense now that I 
think about it!

So the indexing process is:

+-----------+     +----------------+     +-------------+
|companyName|     |  companyName   |     | companyName |
|input data |---->|text_en analysis|---->|    index    |
+-----------+     +----------------+     +-------------+
       |
       |           +----------------+     +-------------+
       +---------->|      text      |---->|    text     |
                   |text_en analysis|     |    index    |
                   +----------------+     +-------------+

Rather than:

+-----------+     +----------------+       +-------------+
|companyName|     |  companyName   |       | companyName |
|input data |---->|text_en analysis|------>|    index    |
+-----------+     +----------------+       +-------------+
                           |
                +---------------------+     +-------------+
                |         text        |---->|    text     |
                |text_general analysis|     |    index    |
                +---------------------+     +-------------+


On 28/01/2019 12:37, Scott Stults wrote:
> Hi Chris,
> 
> You've included the field definition of type text_en, but in your queries
> you're searching the field "text", which is of type text_general. That may
> be the source of your problem, but if looking into that doesn't help send
> the definition of text_general as well.
> 
> Hope that helps!
> 
> -Scott
> 
> On Mon, Jan 28, 2019 at 6:02 AM Chris Wareham <
> chris.wareham@graduate-jobs.com> wrote:
> 
>> I'm trying to index some data which often includes domain names. I'd
>> like to remove the .com TLD, so I have modified the text_en field type
>> by adding a PatternReplaceFilterFactory filter. However, it doesn't
>> appear to be working as a search for "text:(mydomain.com)" matches
>> records but "text:(mydomain)" does not.
>>
>>     <fieldType name="text_en" class="solr.TextField"
>> positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.SynonymGraphFilterFactory" expand="true"
>> ignoreCase="true" synonyms="synonyms.txt"/>
>>         <filter class="solr.StopFilterFactory" words="stopwords.txt"
>> ignoreCase="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.PatternReplaceFilterFactory"
>> pattern="([-a-z])\.com" replacement="$1"/>
>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>         <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>         <filter class="solr.PorterStemFilterFactory"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.SynonymGraphFilterFactory" expand="true"
>> ignoreCase="true" synonyms="synonyms.txt"/>
>>         <filter class="solr.StopFilterFactory" words="stopwords.txt"
>> ignoreCase="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.PatternReplaceFilterFactory"
>> pattern="([-a-z])\.com" replacement="$1"/>
>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>         <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>         <filter class="solr.PorterStemFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>>
>> The actual field definitions are as follows:
>>
>>     <field name="companyName"      type="text_en"      indexed="true"
>> stored="true"  required="true"             />
>>     <field name="jobTitle"         type="text_en"      indexed="true"
>> stored="true"  required="true"             />
>>     <field name="text"             type="text_general" indexed="true"
>> stored="false"                             />
>>
>>     <copyField source="companyName" dest="text" />
>>     <copyField source="jobTitle"    dest="text" />
>>
> 
> 

Re: PatternReplaceFilterFactory problem

Posted by Scott Stults <ss...@opensourceconnections.com>.
Hi Chris,

You've included the field definition of type text_en, but in your queries
you're searching the field "text", which is of type text_general. That may
be the source of your problem, but if looking into that doesn't help send
the definition of text_general as well.

Hope that helps!

-Scott

On Mon, Jan 28, 2019 at 6:02 AM Chris Wareham <
chris.wareham@graduate-jobs.com> wrote:

> I'm trying to index some data which often includes domain names. I'd
> like to remove the .com TLD, so I have modified the text_en field type
> by adding a PatternReplaceFilterFactory filter. However, it doesn't
> appear to be working as a search for "text:(mydomain.com)" matches
> records but "text:(mydomain)" does not.
>
>    <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.SynonymGraphFilterFactory" expand="true"
> ignoreCase="true" synonyms="synonyms.txt"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="([-a-z])\.com" replacement="$1"/>
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.PorterStemFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.SynonymGraphFilterFactory" expand="true"
> ignoreCase="true" synonyms="synonyms.txt"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="([-a-z])\.com" replacement="$1"/>
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.PorterStemFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> The actual field definitions are as follows:
>
>    <field name="companyName"      type="text_en"      indexed="true"
> stored="true"  required="true"             />
>    <field name="jobTitle"         type="text_en"      indexed="true"
> stored="true"  required="true"             />
>    <field name="text"             type="text_general" indexed="true"
> stored="false"                             />
>
>    <copyField source="companyName" dest="text" />
>    <copyField source="jobTitle"    dest="text" />
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: PatternReplaceFilterFactory problem

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
In Admin UI, there is an Analysis screen. You can enter your text and
your query there and see what happens to it at every step of the
processing pipeline.

This should tell you whether the problem is in indexing, query, or
somewhere else entirely (e.g. you are querying a different field as
Scott suggests).

Regards,
   Alex.
P.s. (Semi-)random tip of the day. If you copyField the content, it is
indexed and searched by the rules of the _target_ field. Only when you
search on the field directly, its chain is invoked.

On Mon, 28 Jan 2019 at 06:02, Chris Wareham
<ch...@graduate-jobs.com> wrote:
>
> I'm trying to index some data which often includes domain names. I'd
> like to remove the .com TLD, so I have modified the text_en field type
> by adding a PatternReplaceFilterFactory filter. However, it doesn't
> appear to be working as a search for "text:(mydomain.com)" matches
> records but "text:(mydomain)" does not.
>
>    <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.SynonymGraphFilterFactory" expand="true"
> ignoreCase="true" synonyms="synonyms.txt"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="([-a-z])\.com" replacement="$1"/>
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.PorterStemFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.SynonymGraphFilterFactory" expand="true"
> ignoreCase="true" synonyms="synonyms.txt"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="([-a-z])\.com" replacement="$1"/>
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.PorterStemFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> The actual field definitions are as follows:
>
>    <field name="companyName"      type="text_en"      indexed="true"
> stored="true"  required="true"             />
>    <field name="jobTitle"         type="text_en"      indexed="true"
> stored="true"  required="true"             />
>    <field name="text"             type="text_general" indexed="true"
> stored="false"                             />
>
>    <copyField source="companyName" dest="text" />
>    <copyField source="jobTitle"    dest="text" />