You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alex Willmer <al...@logica.com> on 2012/04/19 18:04:21 UTC

StandardTokenizer and domain names containing digits

TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in 
the same way "ns.define.logica.com" would be?

We are just starting to use Solr 3.5.0 in production and have run into a 
slightly surprising behaviour involving the query "ns1.define.logica.com", 
through an edismax handler with "q.op"=AND defined with

<requestHandler name="search" class="solr.SearchHandler" default="true">
 <lst name="defaults">
   <str name="echoParams">explicit</str>
   <int name="rows">10</int>
   <!-- #define customisations -->
   <str name="defType">edismax</str>
   <str name="q.op">AND</str>
   <str name="qf">
    body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
    author^10.9 changed created oneline^0.7
   </str>
   <str name="pf">
    body^0.2 tags^1.1 title^1.5
   </str>
 </lst>
</requestHandler>

The schema is defined with fields of type text_general, as found in the example 
schema.xml, namely:

<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

The search string is being tokenised to "ns2", "define.logica.com", and the 
resulting query becomes

+DisjunctionMaxQuery((((tags:ns1 tags:define.logica.com)^1.2) | 
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) | 
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1 
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) | 
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1 
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1 
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1 
define.logica.com"^1.5))

meaning that documents containing "ns1" OR "define.logica.com" are returned. 
This is contrary to e.g. "ns.logica.define.com" which is treated as a single 
token. Is there a way I can make Solr treat both queries the same way?

Many thanks, Alex
-- 
Alex Willmer | Developer
2 Trinity Park,  Birmingham, B37 7ES | United Kingdom 
M: +44 7557 752744
al.willmer@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968)
Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom

RE: StandardTokenizer and domain names containing digits

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Alex,

Thanks for reporting back with concrete details of what worked for you - very helpful for others with similar projects.

Steve

-----Original Message-----
From: Alex Willmer [mailto:al.willmer@logica.com] 
Sent: Monday, April 23, 2012 5:35 AM
To: solr-user@lucene.apache.org
Subject: Re: StandardTokenizer and domain names containing digits

Steven A Rowe <sarowe <at> syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary 
> rules from
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-
17.html#Word_Boundaries>. 
> These rules don't include recognition of URLs or domain names.
> 
> Lucene/Solr includes another tokenizer that does recognize URLs and 
> domain
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailT
okenizerFactory>.
>  (Stand-alone domain names are recognized as URLs.)
> 
> My suggestion is that you add a filter (for both the indexing and 
> querying)
that splits tokens containing
> periods:
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterF
ilterFactory>,
> something like (untested!):
> 
>     <filter class="solr.WordDelimiterFilterFactory"
>             splitOnCaseChange="0"
>             splitOnNumerics="0"
>             stemEnglishPossessive="0"
>             generateWordParts="1"
>             preserveOriginal="1" />

Steve, Thank you very much for this reply, it helped immensely. In the end I've gone for your suggestion, plus a swap of StandardTokenizer -> UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The fieldType now looks like

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory"
            synonyms="index_synonyms.txt" ignoreCase="true" 
            expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" 
            synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

autoGeneratePhraseQueries is set so that the tokens generated in the query analyzer behave more like tokens from a space delimited query. So "ns1.define.logica.com" finds a similar set of documents to "ns1 define logica com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR logica OR com". 

Many thanks, Alex

Re: StandardTokenizer and domain names containing digits

Posted by Alex Willmer <al...@logica.com>.

Steven A Rowe <sarowe <at> syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules from 
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-
17.html#Word_Boundaries>. 
> These rules don't include recognition of URLs or domain names.
> 
> Lucene/Solr includes another tokenizer that does recognize URLs and domain 
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailT
okenizerFactory>.
>  (Stand-alone domain names are recognized as URLs.)
> 
> My suggestion is that you add a filter (for both the indexing and querying) 
that splits tokens containing
> periods:
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterF
ilterFactory>,
> something like (untested!):
> 
>     <filter class="solr.WordDelimiterFilterFactory"
>             splitOnCaseChange="0"
>             splitOnNumerics="0"
>             stemEnglishPossessive="0"
>             generateWordParts="1"
>             preserveOriginal="1" />

Steve, Thank you very much for this reply, it helped immensely. In the end I've 
gone for your suggestion, plus a swap of StandardTokenizer -> 
UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The 
fieldType now looks like

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory"
            synonyms="index_synonyms.txt" ignoreCase="true" 
            expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" 
            synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

autoGeneratePhraseQueries is set so that the tokens generated in the query 
analyzer behave more like tokens from a space delimited query. So 
"ns1.define.logica.com" finds a similar set of documents to "ns1 define logica 
com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR 
logica OR com". 

Many thanks, Alex

RE: StandardTokenizer and domain names containing digits

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Alex,

TLDR; Try adding WordDelimiterFilter to your analyzer(s).

StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules from Unicode 6.0.0 Standard Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-17.html#Word_Boundaries>.  These rules don't include recognition of URLs or domain names.  (The details: in UAX#29 Word Boundary rules terminology, the default rule - WB14 - says that boundaries will be made everywhere they are not prohibited, and since there is no rule to prohibit making a boundary in the character sequence /Numeric, MidNumLet, ALetter/ - "." FULL STOP belongs to MidNumLet - boundaries are made between Number and MidNumLet, and between MidNumLet and ALetter.  StandardTokenizer emits as tokens the character sequences between UAX#29 word boundaries that contain alphanumeric characters, so the MidNumLet-only token is dropped.)

Lucene/Solr includes another tokenizer that does recognize URLs and domain names, in addition to the UAX#29 Word Boundary rules: UAX29URLEmailTokenizer <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory>.  (Stand-alone domain names are recognized as URLs.)

I think Lucene/Solr should have a way to tokenize URL (and e-mail) components, so that e.g. if you have "http://www.example.com/page.html" in your text, your index can contain "www.example.com" and "example.com", to enable e.g. queries containing just "example.com".  I'd like to have a URLFilter and an EmailFilter that would configurably tokenize components (e.g. for URLs: protocol; domain; base domain; domain elements; full path; path elements; URL-decoded-uax29-word-boundary-tokenized path elements).

This doesn't solve your problem, though.

My suggestion is that you add a filter (for both the indexing and querying) that splits tokens containing periods: <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>, something like (untested!):

    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="0"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />

Note that this filter will be applied to *all* of your tokens, not just domain names.
 
Steve
 
-----Original Message-----
From: Alex Willmer [mailto:al.willmer@logica.com] 
Sent: Thursday, April 19, 2012 12:04 PM
To: solr-user@lucene.apache.org
Subject: StandardTokenizer and domain names containing digits

TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in the same way "ns.define.logica.com" would be?

We are just starting to use Solr 3.5.0 in production and have run into a slightly surprising behaviour involving the query "ns1.define.logica.com", through an edismax handler with "q.op"=AND defined with

<requestHandler name="search" class="solr.SearchHandler" default="true">  <lst name="defaults">
   <str name="echoParams">explicit</str>
   <int name="rows">10</int>
   <!-- #define customisations -->
   <str name="defType">edismax</str>
   <str name="q.op">AND</str>
   <str name="qf">
    body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
    author^10.9 changed created oneline^0.7
   </str>
   <str name="pf">
    body^0.2 tags^1.1 title^1.5
   </str>
 </lst>
</requestHandler>

The schema is defined with fields of type text_general, as found in the example schema.xml, namely:

<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

The search string is being tokenised to "ns2", "define.logica.com", and the resulting query becomes

+DisjunctionMaxQuery((((tags:ns1 tags:define.logica.com)^1.2) |
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) |
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) |
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1
define.logica.com"^1.5))

meaning that documents containing "ns1" OR "define.logica.com" are returned. 
This is contrary to e.g. "ns.logica.define.com" which is treated as a single token. Is there a way I can make Solr treat both queries the same way?

Many thanks, Alex
--
Alex Willmer | Developer
2 Trinity Park,  Birmingham, B37 7ES | United Kingdom
M: +44 7557 752744
al.willmer@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968) Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom