You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by rachun <ra...@gmail.com> on 2014/05/26 11:26:36 UTC

about analyzer and tokenizer

Dear all,


How can I do this...
I index the document  => Macbook
then when I query mac book I should get the result.

This is my schema setting...

<fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>        
        <filter class="solr.ThaiWordFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_th.txt"/>
      </analyzer>
</fieldType>

Any suggest would be very appreciate.
Chun.




--
View this message in context: http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: about analyzer and tokenizer

Posted by Dmitry Kan <so...@gmail.com>.

Hi Chun,

You can use the edge ngram filter [1] on your tokens, that will produce all
possible letter sequences in a certain (configurable) range, like: ma, ac,
bo, ok, mac, aac, boo, ook, book etc.
Then when querying, both mac and book should hit in the sequence and you
should get the macbook hit back. This comes at a price of increasing your
index size though.

[1]
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EdgeN-GramFilter




On Mon, May 26, 2014 at 12:26 PM, rachun <ra...@gmail.com> wrote:

> Dear all,
>
>
> How can I do this...
> I index the document  => Macbook
> then when I query mac book I should get the result.
>
> This is my schema setting...
>
> <fieldType name="text_th" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ThaiWordFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_th.txt"/>
>       </analyzer>
> </fieldType>
>
> Any suggest would be very appreciate.
> Chun.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

Re: about analyzer and tokenizer

Posted by rachun <ra...@gmail.com>.

Thank you very much  for your suggestion both of you.
I will try more to figure out which way will be match with my case.

Chun.



--
View this message in context: http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129p4138227.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: about analyzer and tokenizer

Posted by Jack Krupansky <ja...@basetechnology.com>.

Unfortunately Solr and Lucene do not provide a truly clean out of the box 
solution for this obvious use case, but you can approximate it by using 
index-time synonyms, so that "mac book" will also index as "macbook" and 
"macbook" will also index as "mac book". Your SYNONYMS.TXT file would 
contain:

macbook,mac book

Only use the synonyms filter at index time. The standard query parsers don't 
support phrases for synonyms.

-- Jack Krupansky

-----Original Message----- 
From: rachun
Sent: Monday, May 26, 2014 5:26 AM
To: solr-user@lucene.apache.org
Subject: about analyzer and tokenizer

Dear all,


How can I do this...
I index the document  => Macbook
then when I query mac book I should get the result.

This is my schema setting...

<fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ThaiWordFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_th.txt"/>
      </analyzer>
</fieldType>

Any suggest would be very appreciate.
Chun.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
Sent from the Solr - User mailing list archive at Nabble.com.