You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by kumar <pa...@gmail.com> on 2014/02/09 18:17:01 UTC

Exact matches

Hi,

Whenever user types the search query like 


"sony xperia c" it has to match the results like 

sony xperia c price
sony xperia c reviews
sony xperia c photos

but my search query displays 

Sony xperia act mobiles
sony xperia ace mobiles
sony xperia abc mobiles



Can anybody help me how to do it.

My schema is like the following....



<field name="my_title" type="text_full" indexed="true" stored="false"
multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />



<fieldType name="text_full" class="solr.TextField">
    <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([\.,;:-_])" replacement=" " replace="all"/>
        <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30"
minGramSize="1"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([^\w\d\*繥ǘŠ])" replacement="" replace="all"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
    </analyzer>
    <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([\.,;:-_])" replacement=" " replace="all"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([^\w\d\*繥ǘŠ])" replacement="" replace="all"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
        <filter class="solr.SynonymFilterFactory" ignoreCase="true"
synonyms="synonyms_fsw.txt" expand="true" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>












--
View this message in context: http://lucene.472066.n3.nabble.com/Exact-matches-tp4116340.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Exact matches

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

And once you get past the basics you may want to keep your eye on
http://quepid.io/

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, Feb 10, 2014 at 1:27 AM, Erick Erickson <er...@gmail.com> wrote:
> Whoa! My first bit of advice is to spend some time getting familiar
> with the admin>>analysis page, because I suspect you're not
> doing what you expect.

Re: Exact matches

Posted by Erick Erickson <er...@gmail.com>.

Whoa! My first bit of advice is to spend some time getting familiar
with the admin>>analysis page, because I suspect you're not
doing what you expect.

1> KeywordTokenizer does NOT break up the input stream, so an
input of "sony xperia c price" gets tokenized as "sony xperia c price",
NOT the words "sony" "xperia" "c" and "price".

2> You use PatternReplace to remove the punctuation etc.

3> You use EdgeNGrams to create tokens like
s, so, son, sony. But then you do NOT use EdgeNGrams in your
query section. So your queries are probably not very robust. The
NGrams are why your matching is odd.

At the end of all this, you have a single string that gets n-grammed,
then an additional PatternReplace is done. I don't think, for instance,
that you will be unable to search for "xperia" and get a hit. I rather
doubt that's what you want, but you know better than me.

So it looks to me like you started out using KeywordTokenizer and then
added a bunch of filters to try to make your results what you expect. It's
possible that the decision to use KeywordTokenizer led you down an
overly-complex path.

I'd start with one of the other tokenizers that breaks things up on
input, e.t. StandardTokenizer, WhitespaceTokenizer, etc., and build up
the analysis chain (e.g. Filters) again, although I notice you have some
CJK characters in your PatternReplace, so whitespace may not be
suitable. If you are analyzing CJK text, there are tokenizers built for that.

All that said, you know your problem space waaaay better than me, so this
may all be complete nonsense.....

Best,
Erick

On Sun, Feb 9, 2014 at 9:17 AM, kumar <pa...@gmail.com> wrote:
> Hi,
>
> Whenever user types the search query like
>
>
> "sony xperia c" it has to match the results like
>
> sony xperia c price
> sony xperia c reviews
> sony xperia c photos
>
> but my search query displays
>
> Sony xperia act mobiles
> sony xperia ace mobiles
> sony xperia abc mobiles
>
>
>
> Can anybody help me how to do it.
>
> My schema is like the following....
>
>
>
> <field name="my_title" type="text_full" indexed="true" stored="false"
> multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />
>
>
>
> <fieldType name="text_full" class="solr.TextField">
>     <analyzer type="index">
>         <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="([\.,;:-_])" replacement=" " replace="all"/>
>         <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30"
> minGramSize="1"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^\w\d\*繥ǘŠ])" replacement="" replace="all"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>     </analyzer>
>     <analyzer type="query">
>         <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="([\.,;:-_])" replacement=" " replace="all"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^\w\d\*繥ǘŠ])" replacement="" replace="all"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
>         <filter class="solr.SynonymFilterFactory" ignoreCase="true"
> synonyms="synonyms_fsw.txt" expand="true" />
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>     </analyzer>
> </fieldType>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Exact-matches-tp4116340.html
> Sent from the Solr - User mailing list archive at Nabble.com.