You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ra...@barclays.com on 2013/12/05 18:05:01 UTC

Custom Tokenizer

Hi,

I have used StandardAnalyzer in my code and it is working fine. One of the challenges that I face is the fact that, this Analyzer by default tokenizes on some special characters such as hyphen, apart from the SPACE character.

I want to tokenize only on the SPACE character. Could you please suggest how I can achieve this?

I got this example when I googled for it. What I want to use is the WhitespaceTokenizer so that data is not manipulated in anyway. I understand that in this case, searches such as "mechanisms" won't return results because of the period (.) at the end. I want to then address this by introducing wild-card searches.

Data: 1097-0215 (i.v) product-123 anti-virus, we investigated the mechanisms. 2266-73 In the present study
Tokens generated with StandardTokenizer:
[1097-0215] [i.v] [product-123] [anti] [virus] [we] [investigated] [the] [mechanisms] [2266-73] [In] [the] [present] [study]
Tokens generated with WhiteSpaceTokenizer:
[1097-0215] [(i.v)] [product-123] [anti-virus,] [we] [investigated] [the] [mechanisms.] [2266-73] [In] [the] [present] [study]
Note: I have tried using the WhitespaceAnalyzer which tokenizes by default ONLY on the space, but my attempt at performing wildcard searches didn't work as expected. Where as, wildcard searches worked fine with StandardAnalyzer.

Please provide your inputs.

Regards,
Raghu


_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

Re: Custom Tokenizer

Posted by Erick Erickson <er...@gmail.com>.
You can also string together one of a myriad of TokenFilters, see:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

I'd recommend spending some time on the admin/analysis page
to understand what all the combinations do. I'd also recommend
against dealing with punctuation etc by using wildcards. When
you use wildcards, the terms matched don't contribute to the
relevance score.

For instance, LowerCaseTokenizerFactory will tokenize all
letter sequences and drop all non-letters.

PatternReplaceFilterFactory will allow you to define with
regexes what you want to be included in your tokens etc. You
could use this in conjunction with WhitespaceTokenizerFactory
for instance.

Or as Furukan suggests, use PatternReplaceCharFilterFactory
to operate on the entire input before it's broken up by
whatever tokenizer you use. Or....

You _really_ should make the effort to define a proper
analysis chain rather than just use wildcards IMO.

Best,
Erick


On Thu, Dec 5, 2013 at 12:24 PM, Furkan KAMACI <fu...@gmail.com>wrote:

> Hi;
>
> Standard tokenizer includes of that bydefault:
>
> StandardFilter, LowerCaseFilter and StopFilter
>
> You can consider char filters. Did you read here:
> https://cwiki.apache.org/confluence/display/solr/CharFilterFactories
>
> Thanks;
> Furkan KAMACI
>
>
> 2013/12/5 <ra...@barclays.com>
>
> > Hi,
> >
> > I have used StandardAnalyzer in my code and it is working fine. One of
> the
> > challenges that I face is the fact that, this Analyzer by default
> tokenizes
> > on some special characters such as hyphen, apart from the SPACE
> character.
> >
> > I want to tokenize only on the SPACE character. Could you please suggest
> > how I can achieve this?
> >
> > I got this example when I googled for it. What I want to use is the
> > WhitespaceTokenizer so that data is not manipulated in anyway. I
> understand
> > that in this case, searches such as "mechanisms" won't return results
> > because of the period (.) at the end. I want to then address this by
> > introducing wild-card searches.
> >
> > Data: 1097-0215 (i.v) product-123 anti-virus, we investigated the
> > mechanisms. 2266-73 In the present study
> > Tokens generated with StandardTokenizer:
> > [1097-0215] [i.v] [product-123] [anti] [virus] [we] [investigated] [the]
> > [mechanisms] [2266-73] [In] [the] [present] [study]
> > Tokens generated with WhiteSpaceTokenizer:
> > [1097-0215] [(i.v)] [product-123] [anti-virus,] [we] [investigated] [the]
> > [mechanisms.] [2266-73] [In] [the] [present] [study]
> > Note: I have tried using the WhitespaceAnalyzer which tokenizes by
> default
> > ONLY on the space, but my attempt at performing wildcard searches didn't
> > work as expected. Where as, wildcard searches worked fine with
> > StandardAnalyzer.
> >
> > Please provide your inputs.
> >
> > Regards,
> > Raghu
> >
> >
> > _______________________________________________
> >
> > This message is for information purposes only, it is not a
> recommendation,
> > advice, offer or solicitation to buy or sell a product or service nor an
> > official confirmation of any transaction. It is directed at persons who
> are
> > professionals and is not intended for retail customer use. Intended for
> > recipient only. This message is subject to the terms at:
> > www.barclays.com/emaildisclaimer.
> >
> > For important disclosures, please see:
> > www.barclays.com/salesandtradingdisclaimer regarding market commentary
> > from Barclays Sales and/or Trading, who are active market participants;
> and
> > in respect of Barclays Research, including disclosures relating to
> specific
> > issuers, please see http://publicresearch.barclays.com.
> >
> > _______________________________________________
> >
>

Re: Custom Tokenizer

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi;

Standard tokenizer includes of that bydefault:

StandardFilter, LowerCaseFilter and StopFilter

You can consider char filters. Did you read here:
https://cwiki.apache.org/confluence/display/solr/CharFilterFactories

Thanks;
Furkan KAMACI


2013/12/5 <ra...@barclays.com>

> Hi,
>
> I have used StandardAnalyzer in my code and it is working fine. One of the
> challenges that I face is the fact that, this Analyzer by default tokenizes
> on some special characters such as hyphen, apart from the SPACE character.
>
> I want to tokenize only on the SPACE character. Could you please suggest
> how I can achieve this?
>
> I got this example when I googled for it. What I want to use is the
> WhitespaceTokenizer so that data is not manipulated in anyway. I understand
> that in this case, searches such as "mechanisms" won't return results
> because of the period (.) at the end. I want to then address this by
> introducing wild-card searches.
>
> Data: 1097-0215 (i.v) product-123 anti-virus, we investigated the
> mechanisms. 2266-73 In the present study
> Tokens generated with StandardTokenizer:
> [1097-0215] [i.v] [product-123] [anti] [virus] [we] [investigated] [the]
> [mechanisms] [2266-73] [In] [the] [present] [study]
> Tokens generated with WhiteSpaceTokenizer:
> [1097-0215] [(i.v)] [product-123] [anti-virus,] [we] [investigated] [the]
> [mechanisms.] [2266-73] [In] [the] [present] [study]
> Note: I have tried using the WhitespaceAnalyzer which tokenizes by default
> ONLY on the space, but my attempt at performing wildcard searches didn't
> work as expected. Where as, wildcard searches worked fine with
> StandardAnalyzer.
>
> Please provide your inputs.
>
> Regards,
> Raghu
>
>
> _______________________________________________
>
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary
> from Barclays Sales and/or Trading, who are active market participants; and
> in respect of Barclays Research, including disclosures relating to specific
> issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>