You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Vincenzo D'Amore <v....@gmail.com> on 2015/10/09 16:50:16 UTC

schema.xml field configuration

Hi,

I have this fieldType configuration:

<fieldType name="cod_parts" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="[-/\@]"
replacement=" " />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1"
catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"
splitOnNumerics="1" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" words="stopwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>

Using Solr Field Analysis tool for the string "0000aaa", in the last step
at end I see this:

text     | 0000aaa | 0000 | 0000aaa | aaa
position | 1       | 1    | 1       | 2
start    | 0       | 0    | 0       | 4
end      | 8       | 4    | 7       | 7
type     | word    | word | word    | word


Now I'm quite surprised to see there are two occurrences of "0000aaa".
Why? I suppose there should be something to do with the position, but I
don't understand what.
RemoveDuplicatesTokenFilterFactory should't remove all the duplicates?


-- 
Vincenzo D'Amore
email: v.damore@gmail.com
skype: free.dev
mobile: +39 349 8513251

Re: schema.xml field configuration

Posted by Erick Erickson <er...@gmail.com>.

Seems odd to me as well. I suspect you can work around
this by either setting catenateall="0" or perserveOriginal="0"

Best,
Erick

On Fri, Oct 9, 2015 at 7:50 AM, Vincenzo D'Amore <v....@gmail.com> wrote:
> Hi,
>
> I have this fieldType configuration:
>
> <fieldType name="cod_parts" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.KeywordTokenizerFactory" />
> <filter class="solr.PatternReplaceFilterFactory" pattern="[-/\@]"
> replacement=" " />
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1"
> catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"
> splitOnNumerics="1" preserveOriginal="1" />
> <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.StopFilterFactory" words="stopwords.txt" />
> <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
> </analyzer>
> </fieldType>
>
> Using Solr Field Analysis tool for the string "0000aaa", in the last step
> at end I see this:
>
> text     | 0000aaa | 0000 | 0000aaa | aaa
> position | 1       | 1    | 1       | 2
> start    | 0       | 0    | 0       | 4
> end      | 8       | 4    | 7       | 7
> type     | word    | word | word    | word
>
>
> Now I'm quite surprised to see there are two occurrences of "0000aaa".
> Why? I suppose there should be something to do with the position, but I
> don't understand what.
> RemoveDuplicatesTokenFilterFactory should't remove all the duplicates?
>
>
> --
> Vincenzo D'Amore
> email: v.damore@gmail.com
> skype: free.dev
> mobile: +39 349 8513251