You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bruno Mannina <bm...@matheo-software.com> on 2019/01/11 09:18:29 UTC

Schema.xml, copyField, Slash, ignoreCase ?

Hello,

 

I’m facing a problem concerning the default field “text” (SOLR 5.4) and
queries which contains / (slash)

 

I need to have default “text” field with:

- ignoreCase, 

- no auto truncation, 

- process slash char

 

I would like to perform only query on the field “text”

Queries can contain:  code or keywords or both.

 

I have 2 fields named symbol and title, and 1 alias ti (old field that I
can’t delete or modify)

 

* Symbol contains code with slash (i.e A62C21/02)

<field name="symbol" type="string_ci" multiValued="false" indexed="true"
required="true" stored="true"/>

 

* Title contains English text and also symbol

    <field name="title" type="text_en" multiValued="true" indexed="true"
stored="true" termVectors="true" termPositions="true" termOffsets="true"/>

 

{ "symbol": "B65D81/20",

"title": [

 "under vacuum or superatmospheric pressure, or in a special atmosphere,
e.g. of inert gas  {(B65D81/28  takes precedence; containers with
pressurising means for maintaining ball pressure A63B39/025)} "

]}

 

* Ti is an alias of title 

    <field name="ti" type="text_general" multiValued="true" indexed="true"
stored="true" termVectors="true" termPositions="true" termOffsets="true"/>

 

* Text is

<field name="text" type="text_general" indexed="true" stored="false"
multiValued="true"/>

 

- Alias are:

 

    <copyField source="title"  dest="ti"/>

    <!-- ALIAS TEXT -->           

    <copyField source="title"  dest="text"/>

    <copyField source="symbol" dest="text"/>

 

 

If I do these queries :

 

* ti:airbag                           à it’s ok

* title:airbag                      à not good for me because it found
airbags

* ti:b65D81/28                  à not good, debug shows ti:b65d81 OR ti:28

* ti:”b65D81/28”              à it’s ok

* symbol:b65D81/28      à it’s ok (even without “ “)

 

NOW with “text” field

* b65D81/28                      à not good, debug shows text:b65d81 OR
text:28

* airbag                               à it’s ok

* “b65D81/28”                  à it’s ok

 

It will be great if I can enter symbol without “ “ 

 

Could you help me to have a text field which solve this problem ? (please
find below all def of my fields)

 

Many thanks for your help.

 

String_ci is my own definition

 

    <fieldType name="string_ci" class="solr.TextField"
sortMissingLast="true" omitNorms="true">

    <analyzer>

      <tokenizer class="solr.KeywordTokenizerFactory"/>

      <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>

    </fieldType>

 

    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

 

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.EnglishPossessiveFilterFactory"/>

        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>

        <filter class="solr.PorterStemFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.EnglishPossessiveFilterFactory"/>

       <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>

        <filter class="solr.PorterStemFilterFactory"/>

      </analyzer>

    </fieldType>

 

 

Best Regards

Bruno

 



---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

RE: Schema.xml, copyField, Slash, ignoreCase ?

Posted by Bruno Mannina <bm...@free.fr>.
Hi Erick,

Thanks for the tip Admin>>UI>>(core)>>analysis, I will investigate this afternoon.

Regards,
Bruno

-----Message d'origine-----
De : Erick Erickson [mailto:erickerickson@gmail.com] 
Envoyé : vendredi 11 janvier 2019 17:18
À : solr-user
Objet : Re: Schema.xml, copyField, Slash, ignoreCase ?

The admin UI>>(select a core)>>analysis page is your friend here. It'll show you exactly what each filter in your analysis chain does and from there you'll need to mix and match filters, your tokenizer and the like to support the use-cases you need.

My guess is that the field type you're using contains WordDelimiterFilterFactory which is splitting up on the slash.
Similarly for your aribag/airbags problem, probably you have one of the stemmers in your analysis chain.

See "Filter Descriptions" in your version of the ref guide.

And one caution: The admin>>core>>analysis chain shows you what happens _after_ query parsing. So if you enter (without quotes) "bing bong" those tokens will be shown. What fools people is that the query _parser_ gets in there first, so they'll then wonder why field:bing bong doesn't work. It's because the parser made it into field:bing default_field:bong. So you'll still (potentially) have to quote or escape some terms on input, it depends on the query parser you're using.

Best,
Erick

On Fri, Jan 11, 2019 at 1:40 AM Bruno Mannina <bm...@matheo-software.com> wrote:
>
> Hello,
>
>
>
> I’m facing a problem concerning the default field “text” (SOLR 5.4) 
> and queries which contains / (slash)
>
>
>
> I need to have default “text” field with:
>
> - ignoreCase,
>
> - no auto truncation,
>
> - process slash char
>
>
>
> I would like to perform only query on the field “text”
>
> Queries can contain:  code or keywords or both.
>
>
>
> I have 2 fields named symbol and title, and 1 alias ti (old field that 
> I can’t delete or modify)
>
>
>
> * Symbol contains code with slash (i.e A62C21/02)
>
> <field name="symbol" type="string_ci" multiValued="false" indexed="true"
> required="true" stored="true"/>
>
>
>
> * Title contains English text and also symbol
>
>     <field name="title" type="text_en" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
>
>
>
> { "symbol": "B65D81/20",
>
> "title": [
>
>  "under vacuum or superatmospheric pressure, or in a special 
> atmosphere, e.g. of inert gas  {(B65D81/28  takes precedence; 
> containers with pressurising means for maintaining ball pressure A63B39/025)} "
>
> ]}
>
>
>
> * Ti is an alias of title
>
>     <field name="ti" type="text_general" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
>
>
>
> * Text is
>
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
>
>
>
> - Alias are:
>
>
>
>     <copyField source="title"  dest="ti"/>
>
>     <!-- ALIAS TEXT -->
>
>     <copyField source="title"  dest="text"/>
>
>     <copyField source="symbol" dest="text"/>
>
>
>
>
>
> If I do these queries :
>
>
>
> * ti:airbag                           à it’s ok
>
> * title:airbag                      à not good for me because it found
> airbags
>
> * ti:b65D81/28                  à not good, debug shows ti:b65d81 OR ti:28
>
> * ti:”b65D81/28”              à it’s ok
>
> * symbol:b65D81/28      à it’s ok (even without “ “)
>
>
>
> NOW with “text” field
>
> * b65D81/28                      à not good, debug shows text:b65d81 OR
> text:28
>
> * airbag                               à it’s ok
>
> * “b65D81/28”                  à it’s ok
>
>
>
> It will be great if I can enter symbol without “ “
>
>
>
> Could you help me to have a text field which solve this problem ? 
> (please find below all def of my fields)
>
>
>
> Many thanks for your help.
>
>
>
> String_ci is my own definition
>
>
>
>     <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>
>     <analyzer>
>
>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>
>       <filter class="solr.LowerCaseFilterFactory"/>
>
>     </analyzer>
>
>     </fieldType>
>
>
>
>     <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
>
>       <analyzer type="index">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
>       <analyzer type="query">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
>     </fieldType>
>
>
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>
>       <analyzer type="index">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>
>         <filter class="solr.PorterStemFilterFactory"/>
>
>       </analyzer>
>
>       <analyzer type="query">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>
>         <filter class="solr.PorterStemFilterFactory"/>
>
>       </analyzer>
>
>     </fieldType>
>
>
>
>
>
> Best Regards
>
> Bruno
>
>
>
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
> https://www.avast.com/antivirus


Re: Schema.xml, copyField, Slash, ignoreCase ?

Posted by Erick Erickson <er...@gmail.com>.
The admin UI>>(select a core)>>analysis page is your friend here. It'll
show you exactly what each filter in your analysis chain does and from
there you'll need to mix and match filters, your tokenizer and the like
to support the use-cases you need.

My guess is that the field type you're using contains
WordDelimiterFilterFactory which is splitting up on the slash.
Similarly for your aribag/airbags problem, probably you have
one of the stemmers in your analysis chain.

See "Filter Descriptions" in your version of the ref guide.

And one caution: The admin>>core>>analysis chain
shows you what happens _after_ query parsing. So if
you enter (without quotes) "bing bong" those tokens
will be shown. What fools people is that the query _parser_
gets in there first, so they'll then wonder why
field:bing bong
doesn't work. It's because the parser made it into
field:bing default_field:bong. So you'll still (potentially)
have to quote or escape some terms on input, it depends
on the query parser you're using.

Best,
Erick

On Fri, Jan 11, 2019 at 1:40 AM Bruno Mannina
<bm...@matheo-software.com> wrote:
>
> Hello,
>
>
>
> I’m facing a problem concerning the default field “text” (SOLR 5.4) and
> queries which contains / (slash)
>
>
>
> I need to have default “text” field with:
>
> - ignoreCase,
>
> - no auto truncation,
>
> - process slash char
>
>
>
> I would like to perform only query on the field “text”
>
> Queries can contain:  code or keywords or both.
>
>
>
> I have 2 fields named symbol and title, and 1 alias ti (old field that I
> can’t delete or modify)
>
>
>
> * Symbol contains code with slash (i.e A62C21/02)
>
> <field name="symbol" type="string_ci" multiValued="false" indexed="true"
> required="true" stored="true"/>
>
>
>
> * Title contains English text and also symbol
>
>     <field name="title" type="text_en" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
>
>
>
> { "symbol": "B65D81/20",
>
> "title": [
>
>  "under vacuum or superatmospheric pressure, or in a special atmosphere,
> e.g. of inert gas  {(B65D81/28  takes precedence; containers with
> pressurising means for maintaining ball pressure A63B39/025)} "
>
> ]}
>
>
>
> * Ti is an alias of title
>
>     <field name="ti" type="text_general" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
>
>
>
> * Text is
>
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
>
>
>
> - Alias are:
>
>
>
>     <copyField source="title"  dest="ti"/>
>
>     <!-- ALIAS TEXT -->
>
>     <copyField source="title"  dest="text"/>
>
>     <copyField source="symbol" dest="text"/>
>
>
>
>
>
> If I do these queries :
>
>
>
> * ti:airbag                           à it’s ok
>
> * title:airbag                      à not good for me because it found
> airbags
>
> * ti:b65D81/28                  à not good, debug shows ti:b65d81 OR ti:28
>
> * ti:”b65D81/28”              à it’s ok
>
> * symbol:b65D81/28      à it’s ok (even without “ “)
>
>
>
> NOW with “text” field
>
> * b65D81/28                      à not good, debug shows text:b65d81 OR
> text:28
>
> * airbag                               à it’s ok
>
> * “b65D81/28”                  à it’s ok
>
>
>
> It will be great if I can enter symbol without “ “
>
>
>
> Could you help me to have a text field which solve this problem ? (please
> find below all def of my fields)
>
>
>
> Many thanks for your help.
>
>
>
> String_ci is my own definition
>
>
>
>     <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>
>     <analyzer>
>
>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>
>       <filter class="solr.LowerCaseFilterFactory"/>
>
>     </analyzer>
>
>     </fieldType>
>
>
>
>     <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
>
>       <analyzer type="index">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
>       <analyzer type="query">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
>     </fieldType>
>
>
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>
>       <analyzer type="index">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>
>         <filter class="solr.PorterStemFilterFactory"/>
>
>       </analyzer>
>
>       <analyzer type="query">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>
>         <filter class="solr.PorterStemFilterFactory"/>
>
>       </analyzer>
>
>     </fieldType>
>
>
>
>
>
> Best Regards
>
> Bruno
>
>
>
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
> https://www.avast.com/antivirus

RE: Schema.xml, copyField, Slash, ignoreCase ?

Posted by Bruno Mannina <bm...@free.fr>.
Hi Steve,

Many thanks for this field, I will test it this afternoon in my dev' server.

Thanks also for your explanation !

Have a nice day !

Bruno

-----Message d'origine-----
De : Steve Rowe [mailto:sarowe@gmail.com] 
Envoyé : vendredi 11 janvier 2019 17:43
À : solr-user@lucene.apache.org
Objet : Re: Schema.xml, copyField, Slash, ignoreCase ?

Hi Bruno,

ignoreCase: Looks like you already have achieved this?

auto truncation: This is caused by inclusion of PorterStemFilterFactory in your "text_en" field type.  If you don't want its effects (i.e. treating different forms of the same word interchangeably), remove the filter.

process slash char: I think you want the slash to be included in symbol terms rather than interpreted as a term separator.  One way to achieve this is to first, pre-tokenization, convert the slash to a string that does not include a term separator, and then post-tokenization, convert the substituted string back to a slash.

Here's a version of your text_en that uses PatternReplaceCharFilterFactory[1] to convert slashes inside of symbol-ish terms (the pattern is a guess based on the symbol text you've provided; you'll likely need to adjust it) to "_": a string unlikely to otherwise occur, and which will not be interpreted by StandardTokenizer as a term separator; and then PatternReplaceFilterFactory[1] to convert "_" back to slashes.  Note that the patterns for the two are slightly different, since the *char filter* is given as input the entire field text, while the *filter* is given the text of single terms.

-----
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="\b([A-Za-z]\d+[A-Za-z]\d+)/(\d+)\b" 
                replacement="$1_$2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" 
            pattern="^([A-Za-z]\d+[A-Za-z]\d+)_(\d+)$" 
            replacement="$1/$2"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="lang/stopwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory"
            protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="\b([A-Za-z]\d+[A-Za-z]\d+)/(\d+)\b" replacement="$1_$2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" 
            pattern="^([A-Za-z]\d+[A-Za-z]\d+)_(\d+)$" 
            replacement="$1/$2"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="lang/stopwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>
-----

[1] http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.4.pdf

--
Steve


> On Jan 11, 2019, at 4:18 AM, Bruno Mannina <bm...@matheo-software.com> wrote:
> 
> I need to have default “text” field with:
> 
> - ignoreCase,
> 
> - no auto truncation,
> 
> - process slash char
> 
> 
> 
> I would like to perform only query on the field “text”
> 
> Queries can contain:  code or keywords or both.
> 
> 
> 
> I have 2 fields named symbol and title, and 1 alias ti (old field that 
> I can’t delete or modify)
> 
> 
> 
> * Symbol contains code with slash (i.e A62C21/02)
> 
> <field name="symbol" type="string_ci" multiValued="false" indexed="true"
> required="true" stored="true"/>
> 
> 
> 
> * Title contains English text and also symbol
> 
>    <field name="title" type="text_en" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
> 
> 
> 
> { "symbol": "B65D81/20",
> 
> "title": [
> 
> "under vacuum or superatmospheric pressure, or in a special 
> atmosphere, e.g. of inert gas  {(B65D81/28  takes precedence; 
> containers with pressurising means for maintaining ball pressure A63B39/025)} "
> 
> ]}
> 
> 
> 
> * Ti is an alias of title
> 
>    <field name="ti" type="text_general" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
> 
> 
> 
> * Text is
> 
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
> 
> 
> 
> - Alias are:
> 
> 
> 
>    <copyField source="title"  dest="ti"/>
> 
>    <!-- ALIAS TEXT -->
> 
>    <copyField source="title"  dest="text"/>
> 
>    <copyField source="symbol" dest="text"/>
> 
> 
> 
> 
> 
> If I do these queries :
> 
> 
> 
> * ti:airbag                           à it’s ok
> 
> * title:airbag                      à not good for me because it found
> airbags
> 
> * ti:b65D81/28                  à not good, debug shows ti:b65d81 OR ti:28
> 
> * ti:”b65D81/28”              à it’s ok
> 
> * symbol:b65D81/28      à it’s ok (even without “ “)
> 
> 
> 
> NOW with “text” field
> 
> * b65D81/28                      à not good, debug shows text:b65d81 OR
> text:28
> 
> * airbag                               à it’s ok
> 
> * “b65D81/28”                  à it’s ok
> 
> 
> 
> It will be great if I can enter symbol without “ “
> 
> 
> 
> Could you help me to have a text field which solve this problem ? 
> (please find below all def of my fields)
> 
> 
> 
> Many thanks for your help.
> 
> 
> 
> String_ci is my own definition
> 
> 
> 
>    <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
> 
>    <analyzer>
> 
>      <tokenizer class="solr.KeywordTokenizerFactory"/>
> 
>      <filter class="solr.LowerCaseFilterFactory"/>
> 
>    </analyzer>
> 
>    </fieldType>
> 
> 
> 
>    <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> 
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> 
>    <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
> 
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> 
>        <filter class="solr.PorterStemFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
> 
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> 
>        <filter class="solr.PorterStemFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> 
> 
> 
> Best Regards
> 
> Bruno
> 
> 
> 
> 
> 
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
> https://www.avast.com/antivirus


Re: Schema.xml, copyField, Slash, ignoreCase ?

Posted by Steve Rowe <sa...@gmail.com>.
Hi Bruno,

ignoreCase: Looks like you already have achieved this?

auto truncation: This is caused by inclusion of PorterStemFilterFactory in your "text_en" field type.  If you don't want its effects (i.e. treating different forms of the same word interchangeably), remove the filter.

process slash char: I think you want the slash to be included in symbol terms rather than interpreted as a term separator.  One way to achieve this is to first, pre-tokenization, convert the slash to a string that does not include a term separator, and then post-tokenization, convert the substituted string back to a slash.

Here's a version of your text_en that uses PatternReplaceCharFilterFactory[1] to convert slashes inside of symbol-ish terms (the pattern is a guess based on the symbol text you've provided; you'll likely need to adjust it) to "_": a string unlikely to otherwise occur, and which will not be interpreted by StandardTokenizer as a term separator; and then PatternReplaceFilterFactory[1] to convert "_" back to slashes.  Note that the patterns for the two are slightly different, since the *char filter* is given as input the entire field text, while the *filter* is given the text of single terms.

----- 
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="\b([A-Za-z]\d+[A-Za-z]\d+)/(\d+)\b" 
                replacement="$1_$2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" 
            pattern="^([A-Za-z]\d+[A-Za-z]\d+)_(\d+)$" 
            replacement="$1/$2"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="lang/stopwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory"
            protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="\b([A-Za-z]\d+[A-Za-z]\d+)/(\d+)\b" replacement="$1_$2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" 
            pattern="^([A-Za-z]\d+[A-Za-z]\d+)_(\d+)$" 
            replacement="$1/$2"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="lang/stopwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>
-----

[1] http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.4.pdf

--
Steve


> On Jan 11, 2019, at 4:18 AM, Bruno Mannina <bm...@matheo-software.com> wrote:
> 
> I need to have default “text” field with:
> 
> - ignoreCase,
> 
> - no auto truncation,
> 
> - process slash char
> 
> 
> 
> I would like to perform only query on the field “text”
> 
> Queries can contain:  code or keywords or both.
> 
> 
> 
> I have 2 fields named symbol and title, and 1 alias ti (old field that I
> can’t delete or modify)
> 
> 
> 
> * Symbol contains code with slash (i.e A62C21/02)
> 
> <field name="symbol" type="string_ci" multiValued="false" indexed="true"
> required="true" stored="true"/>
> 
> 
> 
> * Title contains English text and also symbol
> 
>    <field name="title" type="text_en" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
> 
> 
> 
> { "symbol": "B65D81/20",
> 
> "title": [
> 
> "under vacuum or superatmospheric pressure, or in a special atmosphere,
> e.g. of inert gas  {(B65D81/28  takes precedence; containers with
> pressurising means for maintaining ball pressure A63B39/025)} "
> 
> ]}
> 
> 
> 
> * Ti is an alias of title
> 
>    <field name="ti" type="text_general" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
> 
> 
> 
> * Text is
> 
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
> 
> 
> 
> - Alias are:
> 
> 
> 
>    <copyField source="title"  dest="ti"/>
> 
>    <!-- ALIAS TEXT -->
> 
>    <copyField source="title"  dest="text"/>
> 
>    <copyField source="symbol" dest="text"/>
> 
> 
> 
> 
> 
> If I do these queries :
> 
> 
> 
> * ti:airbag                           à it’s ok
> 
> * title:airbag                      à not good for me because it found
> airbags
> 
> * ti:b65D81/28                  à not good, debug shows ti:b65d81 OR ti:28
> 
> * ti:”b65D81/28”              à it’s ok
> 
> * symbol:b65D81/28      à it’s ok (even without “ “)
> 
> 
> 
> NOW with “text” field
> 
> * b65D81/28                      à not good, debug shows text:b65d81 OR
> text:28
> 
> * airbag                               à it’s ok
> 
> * “b65D81/28”                  à it’s ok
> 
> 
> 
> It will be great if I can enter symbol without “ “
> 
> 
> 
> Could you help me to have a text field which solve this problem ? (please
> find below all def of my fields)
> 
> 
> 
> Many thanks for your help.
> 
> 
> 
> String_ci is my own definition
> 
> 
> 
>    <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
> 
>    <analyzer>
> 
>      <tokenizer class="solr.KeywordTokenizerFactory"/>
> 
>      <filter class="solr.LowerCaseFilterFactory"/>
> 
>    </analyzer>
> 
>    </fieldType>
> 
> 
> 
>    <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> 
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> 
>    <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
> 
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> 
>        <filter class="solr.PorterStemFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
> 
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> 
>        <filter class="solr.PorterStemFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> 
> 
> 
> Best Regards
> 
> Bruno
> 
> 
> 
> 
> 
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
> https://www.avast.com/antivirus