You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Scottie <sc...@live.com> on 2010/08/20 17:19:47 UTC

Tokenising on Each Letter

Just getting ready to launch Solr on one of our websites.

Unfortunately, we can't work out one little issue; how do I configure Solr
such that it can search our model numbers easily? For example:

ADS12P2

If somebody searched for ADS it would match, because currently its split
into tokens when it sees letters and numbers, if somebody did ADS12 it would
also work etc.

But if somebody does ADS1, currently there is no results?

Does anybody know how I should configure Solr such that it will split a
certain field over each letter or wildcard etc?

Kind Regards

Scott
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1247113.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenising on Each Letter

Posted by Erick Erickson <er...@gmail.com>.

Another thing you might try is set preserverOriginal=1
(just saw this in another thread). Which one is "better"
usually depends on your problem space...

Best
Erick

On Mon, Aug 23, 2010 at 9:16 AM, Scottie <sc...@live.com> wrote:

>
> Nikolas, thanks a lot for that, I've just gave it a quick test and it
> definitely seems to work for the examples I've gave.
>
> Thanks again,
>
> Scott
>
>
> From: Nikolas Tautenhahn [via Lucene]
> Sent: Monday, August 23, 2010 3:14 PM
> To: Scottie
> Subject: Re: Tokenising on Each Letter
>
>
> Hi Scottie,
>
> > Could you elaborate about N gram for me, based on my schema?
>
> just a quick reply:
>
>
> >     <fieldType name="textNGram" class="solr.TextField"
> positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >         <!-- in this example, we will only use synonyms at query time
> >         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> -->
> >
> >         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="0" catenateWords="1"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> splitOnNumerics="0" preserveOriginal="1"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.EdgeNGramFilterFactory" side="front" minGramSize="2"
> maxGramSize="30" />
> >         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> >         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> splitOnNumerics="0" preserveOriginal="1"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
>
> Will produce any NGrams from 2 up to 30 Characters, for Info check
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
>
> Be sure to adjust those sizes (minGramSize/maxGramSize) so that
> maxGramSize is big enough to keep the whole original serial number/model
> number and minGramSize is not so small that you fill your index with
> useless information.
>
> Best regards,
> Nikolas Tautenhahn
>
>
>
>
>
>
> --------------------------------------------------------------------------------
>
> View message @
> http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1292238.html
> To unsubscribe from Tokenising on Each Letter, click here.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1294586.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Tokenising on Each Letter

Posted by Scottie <sc...@live.com>.

Nikolas, thanks a lot for that, I've just gave it a quick test and it definitely seems to work for the examples I've gave.

Thanks again,

Scott


From: Nikolas Tautenhahn [via Lucene] 
Sent: Monday, August 23, 2010 3:14 PM
To: Scottie 
Subject: Re: Tokenising on Each Letter


Hi Scottie, 

> Could you elaborate about N gram for me, based on my schema? 

just a quick reply: 


>     <fieldType name="textNGram" class="solr.TextField" positionIncrementGap="100"> 
>       <analyzer type="index"> 
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
>         <!-- in this example, we will only use synonyms at query time 
>         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> 
> 
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0" preserveOriginal="1"/> 
>         <filter class="solr.LowerCaseFilterFactory"/> 
> <filter class="solr.EdgeNGramFilterFactory" side="front" minGramSize="2" maxGramSize="30" /> 
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
>       </analyzer> 
>       <analyzer type="query"> 
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> 
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0" preserveOriginal="1"/> 
> <filter class="solr.LowerCaseFilterFactory"/> 
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
>       </analyzer> 
>     </fieldType> 

Will produce any NGrams from 2 up to 30 Characters, for Info check 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

Be sure to adjust those sizes (minGramSize/maxGramSize) so that 
maxGramSize is big enough to keep the whole original serial number/model 
number and minGramSize is not so small that you fill your index with 
useless information. 

Best regards, 
Nikolas Tautenhahn 





--------------------------------------------------------------------------------

View message @ http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1292238.html 
To unsubscribe from Tokenising on Each Letter, click here. 

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1294586.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenising on Each Letter

Posted by Nikolas Tautenhahn <ni...@livinglogic.de>.

Hi Scottie,

> Could you elaborate about N gram for me, based on my schema?

just a quick reply:

>     <fieldType name="textNGram" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> -->
> 
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0" preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
> 		<filter class="solr.EdgeNGramFilterFactory" side="front" minGramSize="2" maxGramSize="30" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0" preserveOriginal="1"/>
> 		<filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>

Will produce any NGrams from 2 up to 30 Characters, for Info check
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

Be sure to adjust those sizes (minGramSize/maxGramSize) so that
maxGramSize is big enough to keep the whole original serial number/model
number and minGramSize is not so small that you fill your index with
useless information.

Best regards,
Nikolas Tautenhahn

Re: Tokenising on Each Letter

Posted by Scottie <sc...@live.com>.

Probably a good idea to post the relevant information! I guess I thought it
would be a really obvious answer but it seems its a bit more complex ;)

<field name="productsModel" type="textTight" indexed="true" stored="true"
omitNorms="true"/>

    <!-- Less flexible matching, but less false matches.  Probably not ideal
for product names,
         but may be good for SKUs.  Can insert dashes in the wrong place and
still match. -->
    <fieldType name="textTight" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
        <!-- this filter can remove any duplicate tokens that appear at the
same position - sometimes
             possible with WordDelimiterFilter in conjuncton with stemming.
-->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

It seems you may be correct about the catenateAll option, but I'm not sure
if adding in a wildcard at the end of every search would be a great idea?
This is meant to be applied to a general search box, but still retain
flexibility for model numbers. Right now, we are using mySQL % % wildcards
so it matches pretty much anything on the model number, whether you cut off
the start or the end etc, and I wanted to retain that.

Could you elaborate about N gram for me, based on my schema?

The main reason I picked TextTight was for model numbers like
EQW-500DBE-1AVER etc, I thought it would produce better results?

Thanks a lot for the detailed reply.

Scott
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1291984.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenising on Each Letter

Posted by Erick Erickson <er...@gmail.com>.

I suspect (though I can't say for sure since you didn't include your
schema definition, both type and actual field def) that your
problem stems from WordDelimiterFilterFactory options. The
default in the schema usually has catenateall=0. In which case
you have the tokens "ads" and "12" but not "ads12". So searching
for "ads1*" can't work. You could
try varying your worddelimiterfilterfactory parameters (your
specific example works for me), but that may have other
effects on your work.

You could also use a different analysis chain for model number
that didn't even try to split it up. Or you could use one of the
n-gram type filters on your model numbers to give you lots of
flexibility....

And if none of this is germain, can you explain more about what
you're trying to acomplish?

Best
Erick

On Fri, Aug 20, 2010 at 8:19 AM, Scottie <sc...@live.com> wrote:

>
> Just getting ready to launch Solr on one of our websites.
>
> Unfortunately, we can't work out one little issue; how do I configure Solr
> such that it can search our model numbers easily? For example:
>
> ADS12P2
>
> If somebody searched for ADS it would match, because currently its split
> into tokens when it sees letters and numbers, if somebody did ADS12 it
> would
> also work etc.
>
> But if somebody does ADS1, currently there is no results?
>
> Does anybody know how I should configure Solr such that it will split a
> certain field over each letter or wildcard etc?
>
> Kind Regards
>
> Scott
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1247113.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>