You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Michael Engelke <th...@gmail.com> on 2014/02/03 21:33:28 UTC

Re: Not finding part of fulltext field when word ends in dot

That was a complicated answer, but ultimately the right one. Thank you very
much.


2014-01-30 Jack Krupansky <ja...@basetechnology.com>:

> The word delimiter filter will turn 26KA into two tokens, as if you had
> written "26 KA" without the quotes. The autoGeneratePhraseQueries option
> will cause the multiple terms to be treated as if they actually were
> enclosed within quotes, otherwise they will be treated as separate and
> unquoted terms. If you do enclose "26KA" in quotes in your query then
> autoGeneratePhraseQueries is not relevant.
>
> Ah... maybe the problem is that you have preserveOriginal="true" in your
> query analyzer. Do you have your default query operator set to "AND"? If
> so, it would treat "26KA" as "26" AND "KA" AND "26KA", which requires that
> "26KA" (without the trailing dot) to be in the index.
>
> It seems counter-intuitive, but the attributes of the index and query word
> delimiter filters need to be slightly asymmetric.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Thomas Michael Engelke
> Sent: Thursday, January 30, 2014 2:16 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Not finding part of fulltext field when word ends in dot
>
> I'm not sure I got my problem across. If I understand the snippet of
> documentation right, autoGeneratePhraseQueries only affects queries that
> result in multiple tokens, which mine does not. The version also is
> 3.6.0.1, and we're not planning on upgrading to any 4.x version.
>
>
> 2014-01-29 Jack Krupansky <ja...@basetechnology.com>
>
>  You might want to add autoGeneratePhraseQueries="true" to your field
>> type, but I don't think that would cause a break when going from 3.6 to
>> 4.x. The default for that attribute changed in Solr 3.5. What release was
>> your data indexed using? There may have been some subtle word delimiter
>> filter changes between 3.x and 4.x.
>>
>> Read:
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
>> 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
>> adsroot.itcs.umich.edu%3E
>>
>>
>>
>> -----Original Message----- From: Thomas Michael Engelke
>> Sent: Wednesday, January 29, 2014 11:16 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Not finding part of fulltext field when word ends in dot
>>
>>
>> The fieldType definition is a tad on the longer side:
>>
>>                <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>>                        <analyzer type="index">
>>                                <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>
>>                                <filter
>> class="solr.WordDelimiterFilterFactory"
>>                                        catenateWords="1"
>>                                        catenateNumbers="1"
>>                                        generateNumberParts="1"
>>                                        splitOnCaseChange="1"
>>                                        generateWordParts="1"
>>                                        catenateAll="0"
>>                                        preserveOriginal="1"
>>                                        splitOnNumerics="0"
>>                                />
>>
>>                                <filter
>> class="solr.LowerCaseFilterFactory"/>
>>                                <filter class="solr.SynonymFilterFactory"
>> synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
>>                                <filter
>> class="solr.DictionaryCompoundWordTokenFilterFactory"
>>
>> dictionary="german/german-common-nouns.txt"
>>                                        minWordSize="5"
>>                                        minSubwordSize="4"
>>                                        maxSubwordSize="15"
>>                                        onlyLongestMatch="true"
>>                                />
>>
>>                                <filter class="solr.StopFilterFactory"
>> words="german/stopwords.txt" ignoreCase="true"
>> enablePositionIncrements="true"/>
>>                                <filter
>> class="solr.SnowballPorterFilterFactory" language="German2"
>> protected="german/protwords.txt"/>
>>                                <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>                        </analyzer>
>>                        <analyzer type="query">
>>                                <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>
>>                                <filter
>> class="solr.WordDelimiterFilterFactory"
>>                                        catenateWords="0"
>>                                        catenateNumbers="0"
>>                                        generateWordParts="1"
>>                                        splitOnCaseChange="1"
>>                                        generateNumberParts="1"
>>                                        catenateAll="0"
>>                                        preserveOriginal="1"
>>                                        splitOnNumerics="0"
>>                                />
>>                                <filter
>> class="solr.LowerCaseFilterFactory"/>
>>                                <filter class="solr.StopFilterFactory"
>> words="german/stopwords.txt" ignoreCase="true"
>> enablePositionIncrements="true"/>
>>                                <filter
>> class="solr.SnowballPorterFilterFactory" language="German2"
>> protected="german/protwords.txt"/>
>>                                <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>                        </analyzer>
>>                </fieldType>
>>
>>
>> Thank you for taking a look.
>>
>>
>> 2014-01-29 Jack Krupansky <ja...@basetechnology.com>
>>
>>  What field type and analyzer/tokenizer are you using?
>>
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Thomas Michael Engelke Sent: Wednesday,
>>> January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
>>> finding part of fulltext field when word ends in dot
>>> Hello everybody,
>>>
>>> we have a legacy solr installation in version 3.6.0.1. One of the indices
>>> defines a field named "content" as a fulltext field where a product
>>> description will reside. One of the records indexed contains the
>>> following
>>> data (excerpt):
>>>
>>> z. B. in der Serie 26KA.
>>>
>>> I had the problem that searching the value "26KA" didn't find anything.
>>> Using the analyzer of the adminstrative interface and using the full text
>>> on one hand and "26KA" as the query string, I can see how the search
>>> string
>>> is transformed by the used filter factories. The
>>> WordDelimiterFilterFactory
>>> transforms the "26KA." into "26KA", which is displayed like this
>>> (excerpt):
>>>
>>> 73 74  75    76
>>> in der Serie 26KA.
>>>             26KA
>>>
>>> It seems that it stripped the "26KA." of the dot. Using the option to
>>> highlight matches, an analysis search of "26KA" shows the lower of the
>>> two
>>> entries matches (after reaching the LowerCaseFilterFactory). However,
>>> querying the index using the query interface doesn't show any matches.
>>>
>>> I discovered that adding an asterisk to the search seems to work, as does
>>> adding the dot. I am puzzled by this, as I thought that the second added
>>> entry was the word actually indexed. I've tried looking up the definition
>>> of the administrative interface, but the documentation only specifies
>>> this
>>> for the latest version, where the display is different and (at least in
>>> the
>>> sample) doesn't show such "duplication".
>>>
>>> Can anybody shed some light onto this?
>>>
>>>
>>>
>>
>