You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alex Willmer <al...@logica.com> on 2012/04/19 18:04:21 UTC
StandardTokenizer and domain names containing digits
TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in
the same way "ns.define.logica.com" would be?
We are just starting to use Solr 3.5.0 in production and have run into a
slightly surprising behaviour involving the query "ns1.define.logica.com",
through an edismax handler with "q.op"=AND defined with
<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<!-- #define customisations -->
<str name="defType">edismax</str>
<str name="q.op">AND</str>
<str name="qf">
body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
author^10.9 changed created oneline^0.7
</str>
<str name="pf">
body^0.2 tags^1.1 title^1.5
</str>
</lst>
</requestHandler>
The schema is defined with fields of type text_general, as found in the example
schema.xml, namely:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The search string is being tokenised to "ns2", "define.logica.com", and the
resulting query becomes
+DisjunctionMaxQuery((((tags:ns1 tags:define.logica.com)^1.2) |
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) |
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) |
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1
define.logica.com"^1.5))
meaning that documents containing "ns1" OR "define.logica.com" are returned.
This is contrary to e.g. "ns.logica.define.com" which is treated as a single
token. Is there a way I can make Solr treat both queries the same way?
Many thanks, Alex
--
Alex Willmer | Developer
2 Trinity Park, Birmingham, B37 7ES | United Kingdom
M: +44 7557 752744
al.willmer@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968)
Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom
RE: StandardTokenizer and domain names containing digits
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Alex,
Thanks for reporting back with concrete details of what worked for you - very helpful for others with similar projects.
Steve
-----Original Message-----
From: Alex Willmer [mailto:al.willmer@logica.com]
Sent: Monday, April 23, 2012 5:35 AM
To: solr-user@lucene.apache.org
Subject: Re: StandardTokenizer and domain names containing digits
Steven A Rowe <sarowe <at> syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary
> rules from
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-
17.html#Word_Boundaries>.
> These rules don't include recognition of URLs or domain names.
>
> Lucene/Solr includes another tokenizer that does recognize URLs and
> domain
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
>
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailT
okenizerFactory>.
> (Stand-alone domain names are recognized as URLs.)
>
> My suggestion is that you add a filter (for both the indexing and
> querying)
that splits tokens containing
> periods:
>
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterF
ilterFactory>,
> something like (untested!):
>
> <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="0"
> splitOnNumerics="0"
> stemEnglishPossessive="0"
> generateWordParts="1"
> preserveOriginal="1" />
Steve, Thank you very much for this reply, it helped immensely. In the end I've gone for your suggestion, plus a swap of StandardTokenizer -> UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The fieldType now looks like
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="0"
stemEnglishPossessive="0"
generateWordParts="1"
preserveOriginal="1" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="0"
stemEnglishPossessive="0"
generateWordParts="1"
preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
autoGeneratePhraseQueries is set so that the tokens generated in the query analyzer behave more like tokens from a space delimited query. So "ns1.define.logica.com" finds a similar set of documents to "ns1 define logica com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR logica OR com".
Many thanks, Alex
Re: StandardTokenizer and domain names containing digits
Posted by Alex Willmer <al...@logica.com>.
Steven A Rowe <sarowe <at> syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules from
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-
17.html#Word_Boundaries>.
> These rules don't include recognition of URLs or domain names.
>
> Lucene/Solr includes another tokenizer that does recognize URLs and domain
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
>
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailT
okenizerFactory>.
> (Stand-alone domain names are recognized as URLs.)
>
> My suggestion is that you add a filter (for both the indexing and querying)
that splits tokens containing
> periods:
>
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterF
ilterFactory>,
> something like (untested!):
>
> <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="0"
> splitOnNumerics="0"
> stemEnglishPossessive="0"
> generateWordParts="1"
> preserveOriginal="1" />
Steve, Thank you very much for this reply, it helped immensely. In the end I've
gone for your suggestion, plus a swap of StandardTokenizer ->
UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The
fieldType now looks like
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="0"
stemEnglishPossessive="0"
generateWordParts="1"
preserveOriginal="1" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="0"
stemEnglishPossessive="0"
generateWordParts="1"
preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
autoGeneratePhraseQueries is set so that the tokens generated in the query
analyzer behave more like tokens from a space delimited query. So
"ns1.define.logica.com" finds a similar set of documents to "ns1 define logica
com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR
logica OR com".
Many thanks, Alex
RE: StandardTokenizer and domain names containing digits
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Alex,
TLDR; Try adding WordDelimiterFilter to your analyzer(s).
StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules from Unicode 6.0.0 Standard Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-17.html#Word_Boundaries>. These rules don't include recognition of URLs or domain names. (The details: in UAX#29 Word Boundary rules terminology, the default rule - WB14 - says that boundaries will be made everywhere they are not prohibited, and since there is no rule to prohibit making a boundary in the character sequence /Numeric, MidNumLet, ALetter/ - "." FULL STOP belongs to MidNumLet - boundaries are made between Number and MidNumLet, and between MidNumLet and ALetter. StandardTokenizer emits as tokens the character sequences between UAX#29 word boundaries that contain alphanumeric characters, so the MidNumLet-only token is dropped.)
Lucene/Solr includes another tokenizer that does recognize URLs and domain names, in addition to the UAX#29 Word Boundary rules: UAX29URLEmailTokenizer <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory>. (Stand-alone domain names are recognized as URLs.)
I think Lucene/Solr should have a way to tokenize URL (and e-mail) components, so that e.g. if you have "http://www.example.com/page.html" in your text, your index can contain "www.example.com" and "example.com", to enable e.g. queries containing just "example.com". I'd like to have a URLFilter and an EmailFilter that would configurably tokenize components (e.g. for URLs: protocol; domain; base domain; domain elements; full path; path elements; URL-decoded-uax29-word-boundary-tokenized path elements).
This doesn't solve your problem, though.
My suggestion is that you add a filter (for both the indexing and querying) that splits tokens containing periods: <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>, something like (untested!):
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="0"
splitOnNumerics="0"
stemEnglishPossessive="0"
generateWordParts="1"
preserveOriginal="1" />
Note that this filter will be applied to *all* of your tokens, not just domain names.
Steve
-----Original Message-----
From: Alex Willmer [mailto:al.willmer@logica.com]
Sent: Thursday, April 19, 2012 12:04 PM
To: solr-user@lucene.apache.org
Subject: StandardTokenizer and domain names containing digits
TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in the same way "ns.define.logica.com" would be?
We are just starting to use Solr 3.5.0 in production and have run into a slightly surprising behaviour involving the query "ns1.define.logica.com", through an edismax handler with "q.op"=AND defined with
<requestHandler name="search" class="solr.SearchHandler" default="true"> <lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<!-- #define customisations -->
<str name="defType">edismax</str>
<str name="q.op">AND</str>
<str name="qf">
body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
author^10.9 changed created oneline^0.7
</str>
<str name="pf">
body^0.2 tags^1.1 title^1.5
</str>
</lst>
</requestHandler>
The schema is defined with fields of type text_general, as found in the example schema.xml, namely:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The search string is being tokenised to "ns2", "define.logica.com", and the resulting query becomes
+DisjunctionMaxQuery((((tags:ns1 tags:define.logica.com)^1.2) |
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) |
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) |
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1
define.logica.com"^1.5))
meaning that documents containing "ns1" OR "define.logica.com" are returned.
This is contrary to e.g. "ns.logica.define.com" which is treated as a single token. Is there a way I can make Solr treat both queries the same way?
Many thanks, Alex
--
Alex Willmer | Developer
2 Trinity Park, Birmingham, B37 7ES | United Kingdom
M: +44 7557 752744
al.willmer@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968) Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom