You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by rachun <ra...@gmail.com> on 2014/05/26 11:26:36 UTC
about analyzer and tokenizer
Dear all,
How can I do this...
I index the document => Macbook
then when I query mac book I should get the result.
This is my schema setting...
<fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ThaiWordFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_th.txt"/>
</analyzer>
</fieldType>
Any suggest would be very appreciate.
Chun.
--
View this message in context: http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: about analyzer and tokenizer
Posted by Dmitry Kan <so...@gmail.com>.
Hi Chun,
You can use the edge ngram filter [1] on your tokens, that will produce all
possible letter sequences in a certain (configurable) range, like: ma, ac,
bo, ok, mac, aac, boo, ook, book etc.
Then when querying, both mac and book should hit in the sequence and you
should get the macbook hit back. This comes at a price of increasing your
index size though.
[1]
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EdgeN-GramFilter
On Mon, May 26, 2014 at 12:26 PM, rachun <ra...@gmail.com> wrote:
> Dear all,
>
>
> How can I do this...
> I index the document => Macbook
> then when I query mac book I should get the result.
>
> This is my schema setting...
>
> <fieldType name="text_th" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.ThaiWordFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_th.txt"/>
> </analyzer>
> </fieldType>
>
> Any suggest would be very appreciate.
> Chun.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
--
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
Re: about analyzer and tokenizer
Posted by rachun <ra...@gmail.com>.
Thank you very much for your suggestion both of you.
I will try more to figure out which way will be match with my case.
Chun.
--
View this message in context: http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129p4138227.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: about analyzer and tokenizer
Posted by Jack Krupansky <ja...@basetechnology.com>.
Unfortunately Solr and Lucene do not provide a truly clean out of the box
solution for this obvious use case, but you can approximate it by using
index-time synonyms, so that "mac book" will also index as "macbook" and
"macbook" will also index as "mac book". Your SYNONYMS.TXT file would
contain:
macbook,mac book
Only use the synonyms filter at index time. The standard query parsers don't
support phrases for synonyms.
-- Jack Krupansky
-----Original Message-----
From: rachun
Sent: Monday, May 26, 2014 5:26 AM
To: solr-user@lucene.apache.org
Subject: about analyzer and tokenizer
Dear all,
How can I do this...
I index the document => Macbook
then when I query mac book I should get the result.
This is my schema setting...
<fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ThaiWordFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_th.txt"/>
</analyzer>
</fieldType>
Any suggest would be very appreciate.
Chun.
--
View this message in context:
http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
Sent from the Solr - User mailing list archive at Nabble.com.