You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Whelan, Andy" <aw...@srcinc.com> on 2016/10/03 18:51:51 UTC

Preceding special characters in ClassicTokenizerFactory

Hello,
I am guessing that what I am looking for is probably going to require extending StandardTokenizerFactory or ClassicTokenizerFactory. But I thought I would ask the group here before attempting this. We are indexing documents from an eclectic set of sources. There is, however, a heavy interest in computing and social media sources. So computer terminology and social media terms (terms beginning with hashes (#), @ symbols, etc.) are terms that we would like to have searchable.

We are considering the ClassicTokenizerFactory because we like the fact that it does not use the Unicode standard annex UAX#29<http://unicode.org/reports/tr29/#Word_Boundaries> word boundary rules. It preserves email addresses, internet domain names, etc.  We would also like to use it as the tokenizer element of index and query analyzers that would preserve @< rest of token > or #<rest of token> patterns.

I have seen examples where folks are replacing the StandardTokenizerFactory in their analyzer with stream combinations made up of charFilters,  WhitespaceTokenizerFactory, etc. as in the following article http://www.prowave.io/indexing-special-terms-using-solr/ to remedy such problems.

Example:
         <analyzer type="index">
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.\s)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.$)" replacement="" />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(,)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(;)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\|)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\/)" replacement=" " />
                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt" ignoreCase="true" expand="false"/>
                 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
                 <filter class="solr.LowerCaseFilterFactory"/>
         </analyzer>


I am just wondering if anyone knew of a smart way (without extending classes) to actually preserve most of the ClassicTokenizerFactory functionality without getting rid of leading special characters? The "Solr In Action" book (page 179) claims that it is hard to extend the StandardTokenizerFactory. I'm assuming this is the same for ClassicTokenizerFactory.

Thanks
-Andrew


Re: Preceding special characters in ClassicTokenizerFactory

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Andy,

WordDelimeterFilter has "types" option. There is an example file named wdftypes.txt in the source tree that preserves #hashtags and @mentions. If you follow this path, please use Whitespace tokenizer.

Ahmet



On Monday, October 3, 2016 9:52 PM, "Whelan, Andy" <aw...@srcinc.com> wrote:
Hello,
I am guessing that what I am looking for is probably going to require extending StandardTokenizerFactory or ClassicTokenizerFactory. But I thought I would ask the group here before attempting this. We are indexing documents from an eclectic set of sources. There is, however, a heavy interest in computing and social media sources. So computer terminology and social media terms (terms beginning with hashes (#), @ symbols, etc.) are terms that we would like to have searchable.

We are considering the ClassicTokenizerFactory because we like the fact that it does not use the Unicode standard annex UAX#29<http://unicode.org/reports/tr29/#Word_Boundaries> word boundary rules. It preserves email addresses, internet domain names, etc.  We would also like to use it as the tokenizer element of index and query analyzers that would preserve @< rest of token > or #<rest of token> patterns.

I have seen examples where folks are replacing the StandardTokenizerFactory in their analyzer with stream combinations made up of charFilters,  WhitespaceTokenizerFactory, etc. as in the following article http://www.prowave.io/indexing-special-terms-using-solr/ to remedy such problems.

Example:
         <analyzer type="index">
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.\s)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.$)" replacement="" />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(,)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(;)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\|)" replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\/)" replacement=" " />
                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt" ignoreCase="true" expand="false"/>
                 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
                 <filter class="solr.LowerCaseFilterFactory"/>
         </analyzer>


I am just wondering if anyone knew of a smart way (without extending classes) to actually preserve most of the ClassicTokenizerFactory functionality without getting rid of leading special characters? The "Solr In Action" book (page 179) claims that it is hard to extend the StandardTokenizerFactory. I'm assuming this is the same for ClassicTokenizerFactory.

Thanks
-Andrew