You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Modassar Ather <mo...@gmail.com> on 2016/08/02 04:00:55 UTC
Regarding HTMLStripCharFilter.
Hi,
Kindly help me understand the way HTMLStripCharFilter works.
I have following analysis chain.
int flags = WordDelimiterFilter.GENERATE_WORD_PARTS
| WordDelimiterFilter.GENERATE_NUMBER_PARTS
| WordDelimiterFilter.CATENATE_WORDS
| WordDelimiterFilter.CATENATE_NUMBERS
| WordDelimiterFilter.CATENATE_ALL
| WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
| WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE
| WordDelimiterFilter.PRESERVE_ORIGINAL;
@Override
protected Reader initReader(String field, Reader reader) {
return new HTMLStripCharFilter(reader);
}
@Override
protected TokenStreamComponents createComponents(String arg0) {
Tokenizer source = new WhitespaceTokenizer();
TokenStream wordDMTStrem = new WordDelimiterFilter(source, flags,
null);
TokenStream rdtStream = new
RemoveDuplicatesTokenFilter(wordDMTStrem);
return new TokenStreamComponents(source, rdtStream);
}
*teRm<sub>3</sub>* returns following analyzed tokens by above analysis
chain.
*Text Position Increment Position Length Offset attribute*
teRm3 1 1 0,
16
Rm3 1 1
0, 16
te 0 1
0, 16
teRm3 0 1 0,
16
Here in the above table teRm3 has occurred twice but not removed by
RemoveDuplicatesTokenFilter.
Whereas *teRm3* gets tokenized with the same analysis chain as below .
*Text Position Increment Position Length Offset attribute*
teRm3 1 1 0, 5
te 0 1 0, 2
Rm3 1 1 2, 5
Here in above table *teRm3* was removed by RemoveDuplicatesTokenFilter so
no duplicate for it.
Please share your comments on this difference in behavior of analysis.
Thanks,
Modassar
Re: Regarding HTMLStripCharFilter.
Posted by Modassar Ather <mo...@gmail.com>.
Hi,
Please provide your inputs.
Thanks,
Modassar
On Tue, Aug 2, 2016 at 9:30 AM, Modassar Ather <mo...@gmail.com>
wrote:
> Hi,
>
> Kindly help me understand the way HTMLStripCharFilter works.
>
> I have following analysis chain.
>
> int flags = WordDelimiterFilter.GENERATE_WORD_PARTS
> | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> | WordDelimiterFilter.CATENATE_WORDS
> | WordDelimiterFilter.CATENATE_NUMBERS
> | WordDelimiterFilter.CATENATE_ALL
> | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
> | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE
> | WordDelimiterFilter.PRESERVE_ORIGINAL;
>
> @Override
> protected Reader initReader(String field, Reader reader) {
> return new HTMLStripCharFilter(reader);
> }
>
> @Override
> protected TokenStreamComponents createComponents(String arg0) {
> Tokenizer source = new WhitespaceTokenizer();
> TokenStream wordDMTStrem = new WordDelimiterFilter(source, flags,
> null);
> TokenStream rdtStream = new
> RemoveDuplicatesTokenFilter(wordDMTStrem);
>
> return new TokenStreamComponents(source, rdtStream);
> }
>
> *teRm<sub>3</sub>* returns following analyzed tokens by above analysis
> chain.
>
> *Text Position Increment Position Length Offset attribute*
> teRm3 1 1
> 0, 16
> Rm3 1 1
> 0, 16
> te 0 1
> 0, 16
> teRm3 0 1
> 0, 16
>
> Here in the above table teRm3 has occurred twice but not removed by
> RemoveDuplicatesTokenFilter.
>
> Whereas *teRm3* gets tokenized with the same analysis chain as below .
>
> *Text Position Increment Position Length Offset attribute*
> teRm3 1 1 0, 5
> te 0 1 0,
> 2
> Rm3 1 1 2, 5
>
> Here in above table *teRm3* was removed by RemoveDuplicatesTokenFilter so
> no duplicate for it.
>
> Please share your comments on this difference in behavior of analysis.
>
> Thanks,
> Modassar
>