You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Modassar Ather <mo...@gmail.com> on 2016/08/02 04:00:55 UTC

Regarding HTMLStripCharFilter.

Hi,

Kindly help me understand the way HTMLStripCharFilter works.

I have following analysis chain.

int flags = WordDelimiterFilter.GENERATE_WORD_PARTS
        | WordDelimiterFilter.GENERATE_NUMBER_PARTS
        | WordDelimiterFilter.CATENATE_WORDS
        | WordDelimiterFilter.CATENATE_NUMBERS
        | WordDelimiterFilter.CATENATE_ALL
        | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
        | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE
        | WordDelimiterFilter.PRESERVE_ORIGINAL;

    @Override
    protected Reader initReader(String field, Reader reader) {
        return new HTMLStripCharFilter(reader);
    }

    @Override
    protected TokenStreamComponents createComponents(String arg0) {
        Tokenizer source = new WhitespaceTokenizer();
        TokenStream wordDMTStrem = new WordDelimiterFilter(source, flags,
null);
        TokenStream rdtStream = new
RemoveDuplicatesTokenFilter(wordDMTStrem);

        return new TokenStreamComponents(source, rdtStream);
    }

*teRm<sub>3</sub>* returns following analyzed tokens by above analysis
chain.

*Text       Position Increment    Position Length      Offset attribute*
teRm3   1                                1                               0,
16
Rm3      1                                1
0, 16
te          0                                1
                       0, 16
teRm3   0                                1                               0,
16

Here in the above table teRm3 has occurred twice but not removed by
RemoveDuplicatesTokenFilter.

Whereas *teRm3* gets tokenized with the same analysis chain as below .

*Text      Position Increment    Position Length    Offset attribute*
teRm3   1                               1                           0, 5
te          0                               1                           0, 2
Rm3      1                               1                           2, 5

Here in above table *teRm3* was removed by RemoveDuplicatesTokenFilter so
no duplicate for it.

Please share your comments on this difference in behavior of analysis.

Thanks,
Modassar

Re: Regarding HTMLStripCharFilter.

Posted by Modassar Ather <mo...@gmail.com>.
Hi,

Please provide your inputs.

Thanks,
Modassar

On Tue, Aug 2, 2016 at 9:30 AM, Modassar Ather <mo...@gmail.com>
wrote:

> Hi,
>
> Kindly help me understand the way HTMLStripCharFilter works.
>
> I have following analysis chain.
>
> int flags = WordDelimiterFilter.GENERATE_WORD_PARTS
>         | WordDelimiterFilter.GENERATE_NUMBER_PARTS
>         | WordDelimiterFilter.CATENATE_WORDS
>         | WordDelimiterFilter.CATENATE_NUMBERS
>         | WordDelimiterFilter.CATENATE_ALL
>         | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
>         | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE
>         | WordDelimiterFilter.PRESERVE_ORIGINAL;
>
>     @Override
>     protected Reader initReader(String field, Reader reader) {
>         return new HTMLStripCharFilter(reader);
>     }
>
>     @Override
>     protected TokenStreamComponents createComponents(String arg0) {
>         Tokenizer source = new WhitespaceTokenizer();
>         TokenStream wordDMTStrem = new WordDelimiterFilter(source, flags,
> null);
>         TokenStream rdtStream = new
> RemoveDuplicatesTokenFilter(wordDMTStrem);
>
>         return new TokenStreamComponents(source, rdtStream);
>     }
>
> *teRm<sub>3</sub>* returns following analyzed tokens by above analysis
> chain.
>
> *Text       Position Increment    Position Length      Offset attribute*
> teRm3   1                                1
> 0, 16
> Rm3      1                                1
> 0, 16
> te          0                                1
>                        0, 16
> teRm3   0                                1
> 0, 16
>
> Here in the above table teRm3 has occurred twice but not removed by
> RemoveDuplicatesTokenFilter.
>
> Whereas *teRm3* gets tokenized with the same analysis chain as below .
>
> *Text      Position Increment    Position Length    Offset attribute*
> teRm3   1                               1                           0, 5
> te          0                               1                           0,
> 2
> Rm3      1                               1                           2, 5
>
> Here in above table *teRm3* was removed by RemoveDuplicatesTokenFilter so
> no duplicate for it.
>
> Please share your comments on this difference in behavior of analysis.
>
> Thanks,
> Modassar
>