You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Michael Engelke <th...@gmail.com> on 2014/01/29 16:45:15 UTC

Not finding part of fulltext field when word ends in dot

Hello everybody,

we have a legacy solr installation in version 3.6.0.1. One of the indices
defines a field named "content" as a fulltext field where a product
description will reside. One of the records indexed contains the following
data (excerpt):

z. B. in der Serie 26KA.

I had the problem that searching the value "26KA" didn't find anything.
Using the analyzer of the adminstrative interface and using the full text
on one hand and "26KA" as the query string, I can see how the search string
is transformed by the used filter factories. The WordDelimiterFilterFactory
transforms the "26KA." into "26KA", which is displayed like this (excerpt):

73 74  75    76
in der Serie 26KA.
             26KA

It seems that it stripped the "26KA." of the dot. Using the option to
highlight matches, an analysis search of "26KA" shows the lower of the two
entries matches (after reaching the LowerCaseFilterFactory). However,
querying the index using the query interface doesn't show any matches.

I discovered that adding an asterisk to the search seems to work, as does
adding the dot. I am puzzled by this, as I thought that the second added
entry was the word actually indexed. I've tried looking up the definition
of the administrative interface, but the documentation only specifies this
for the latest version, where the display is different and (at least in the
sample) doesn't show such "duplication".

Can anybody shed some light onto this?

Re: Not finding part of fulltext field when word ends in dot

Posted by Thomas Michael Engelke <th...@gmail.com>.
That was a complicated answer, but ultimately the right one. Thank you very
much.


2014-01-30 Jack Krupansky <ja...@basetechnology.com>:

> The word delimiter filter will turn 26KA into two tokens, as if you had
> written "26 KA" without the quotes. The autoGeneratePhraseQueries option
> will cause the multiple terms to be treated as if they actually were
> enclosed within quotes, otherwise they will be treated as separate and
> unquoted terms. If you do enclose "26KA" in quotes in your query then
> autoGeneratePhraseQueries is not relevant.
>
> Ah... maybe the problem is that you have preserveOriginal="true" in your
> query analyzer. Do you have your default query operator set to "AND"? If
> so, it would treat "26KA" as "26" AND "KA" AND "26KA", which requires that
> "26KA" (without the trailing dot) to be in the index.
>
> It seems counter-intuitive, but the attributes of the index and query word
> delimiter filters need to be slightly asymmetric.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Thomas Michael Engelke
> Sent: Thursday, January 30, 2014 2:16 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Not finding part of fulltext field when word ends in dot
>
> I'm not sure I got my problem across. If I understand the snippet of
> documentation right, autoGeneratePhraseQueries only affects queries that
> result in multiple tokens, which mine does not. The version also is
> 3.6.0.1, and we're not planning on upgrading to any 4.x version.
>
>
> 2014-01-29 Jack Krupansky <ja...@basetechnology.com>
>
>  You might want to add autoGeneratePhraseQueries="true" to your field
>> type, but I don't think that would cause a break when going from 3.6 to
>> 4.x. The default for that attribute changed in Solr 3.5. What release was
>> your data indexed using? There may have been some subtle word delimiter
>> filter changes between 3.x and 4.x.
>>
>> Read:
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
>> 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
>> adsroot.itcs.umich.edu%3E
>>
>>
>>
>> -----Original Message----- From: Thomas Michael Engelke
>> Sent: Wednesday, January 29, 2014 11:16 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Not finding part of fulltext field when word ends in dot
>>
>>
>> The fieldType definition is a tad on the longer side:
>>
>>                <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>>                        <analyzer type="index">
>>                                <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>
>>                                <filter
>> class="solr.WordDelimiterFilterFactory"
>>                                        catenateWords="1"
>>                                        catenateNumbers="1"
>>                                        generateNumberParts="1"
>>                                        splitOnCaseChange="1"
>>                                        generateWordParts="1"
>>                                        catenateAll="0"
>>                                        preserveOriginal="1"
>>                                        splitOnNumerics="0"
>>                                />
>>
>>                                <filter
>> class="solr.LowerCaseFilterFactory"/>
>>                                <filter class="solr.SynonymFilterFactory"
>> synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
>>                                <filter
>> class="solr.DictionaryCompoundWordTokenFilterFactory"
>>
>> dictionary="german/german-common-nouns.txt"
>>                                        minWordSize="5"
>>                                        minSubwordSize="4"
>>                                        maxSubwordSize="15"
>>                                        onlyLongestMatch="true"
>>                                />
>>
>>                                <filter class="solr.StopFilterFactory"
>> words="german/stopwords.txt" ignoreCase="true"
>> enablePositionIncrements="true"/>
>>                                <filter
>> class="solr.SnowballPorterFilterFactory" language="German2"
>> protected="german/protwords.txt"/>
>>                                <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>                        </analyzer>
>>                        <analyzer type="query">
>>                                <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>
>>                                <filter
>> class="solr.WordDelimiterFilterFactory"
>>                                        catenateWords="0"
>>                                        catenateNumbers="0"
>>                                        generateWordParts="1"
>>                                        splitOnCaseChange="1"
>>                                        generateNumberParts="1"
>>                                        catenateAll="0"
>>                                        preserveOriginal="1"
>>                                        splitOnNumerics="0"
>>                                />
>>                                <filter
>> class="solr.LowerCaseFilterFactory"/>
>>                                <filter class="solr.StopFilterFactory"
>> words="german/stopwords.txt" ignoreCase="true"
>> enablePositionIncrements="true"/>
>>                                <filter
>> class="solr.SnowballPorterFilterFactory" language="German2"
>> protected="german/protwords.txt"/>
>>                                <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>                        </analyzer>
>>                </fieldType>
>>
>>
>> Thank you for taking a look.
>>
>>
>> 2014-01-29 Jack Krupansky <ja...@basetechnology.com>
>>
>>  What field type and analyzer/tokenizer are you using?
>>
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Thomas Michael Engelke Sent: Wednesday,
>>> January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
>>> finding part of fulltext field when word ends in dot
>>> Hello everybody,
>>>
>>> we have a legacy solr installation in version 3.6.0.1. One of the indices
>>> defines a field named "content" as a fulltext field where a product
>>> description will reside. One of the records indexed contains the
>>> following
>>> data (excerpt):
>>>
>>> z. B. in der Serie 26KA.
>>>
>>> I had the problem that searching the value "26KA" didn't find anything.
>>> Using the analyzer of the adminstrative interface and using the full text
>>> on one hand and "26KA" as the query string, I can see how the search
>>> string
>>> is transformed by the used filter factories. The
>>> WordDelimiterFilterFactory
>>> transforms the "26KA." into "26KA", which is displayed like this
>>> (excerpt):
>>>
>>> 73 74  75    76
>>> in der Serie 26KA.
>>>             26KA
>>>
>>> It seems that it stripped the "26KA." of the dot. Using the option to
>>> highlight matches, an analysis search of "26KA" shows the lower of the
>>> two
>>> entries matches (after reaching the LowerCaseFilterFactory). However,
>>> querying the index using the query interface doesn't show any matches.
>>>
>>> I discovered that adding an asterisk to the search seems to work, as does
>>> adding the dot. I am puzzled by this, as I thought that the second added
>>> entry was the word actually indexed. I've tried looking up the definition
>>> of the administrative interface, but the documentation only specifies
>>> this
>>> for the latest version, where the display is different and (at least in
>>> the
>>> sample) doesn't show such "duplication".
>>>
>>> Can anybody shed some light onto this?
>>>
>>>
>>>
>>
>

Re: Not finding part of fulltext field when word ends in dot

Posted by Jack Krupansky <ja...@basetechnology.com>.
The word delimiter filter will turn 26KA into two tokens, as if you had 
written "26 KA" without the quotes. The autoGeneratePhraseQueries option 
will cause the multiple terms to be treated as if they actually were 
enclosed within quotes, otherwise they will be treated as separate and 
unquoted terms. If you do enclose "26KA" in quotes in your query then 
autoGeneratePhraseQueries is not relevant.

Ah... maybe the problem is that you have preserveOriginal="true" in your 
query analyzer. Do you have your default query operator set to "AND"? If so, 
it would treat "26KA" as "26" AND "KA" AND "26KA", which requires that 
"26KA" (without the trailing dot) to be in the index.

It seems counter-intuitive, but the attributes of the index and query word 
delimiter filters need to be slightly asymmetric.

-- Jack Krupansky

-----Original Message----- 
From: Thomas Michael Engelke
Sent: Thursday, January 30, 2014 2:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Not finding part of fulltext field when word ends in dot

I'm not sure I got my problem across. If I understand the snippet of
documentation right, autoGeneratePhraseQueries only affects queries that
result in multiple tokens, which mine does not. The version also is
3.6.0.1, and we're not planning on upgrading to any 4.x version.


2014-01-29 Jack Krupansky <ja...@basetechnology.com>

> You might want to add autoGeneratePhraseQueries="true" to your field
> type, but I don't think that would cause a break when going from 3.6 to
> 4.x. The default for that attribute changed in Solr 3.5. What release was
> your data indexed using? There may have been some subtle word delimiter
> filter changes between 3.x and 4.x.
>
> Read:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
> 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
> adsroot.itcs.umich.edu%3E
>
>
>
> -----Original Message----- From: Thomas Michael Engelke
> Sent: Wednesday, January 29, 2014 11:16 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Not finding part of fulltext field when word ends in dot
>
>
> The fieldType definition is a tad on the longer side:
>
>                <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>                        <analyzer type="index">
>                                <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>
>                                <filter
> class="solr.WordDelimiterFilterFactory"
>                                        catenateWords="1"
>                                        catenateNumbers="1"
>                                        generateNumberParts="1"
>                                        splitOnCaseChange="1"
>                                        generateWordParts="1"
>                                        catenateAll="0"
>                                        preserveOriginal="1"
>                                        splitOnNumerics="0"
>                                />
>
>                                <filter
> class="solr.LowerCaseFilterFactory"/>
>                                <filter class="solr.SynonymFilterFactory"
> synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
>                                <filter
> class="solr.DictionaryCompoundWordTokenFilterFactory"
>
> dictionary="german/german-common-nouns.txt"
>                                        minWordSize="5"
>                                        minSubwordSize="4"
>                                        maxSubwordSize="15"
>                                        onlyLongestMatch="true"
>                                />
>
>                                <filter class="solr.StopFilterFactory"
> words="german/stopwords.txt" ignoreCase="true"
> enablePositionIncrements="true"/>
>                                <filter
> class="solr.SnowballPorterFilterFactory" language="German2"
> protected="german/protwords.txt"/>
>                                <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        </analyzer>
>                        <analyzer type="query">
>                                <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>
>                                <filter
> class="solr.WordDelimiterFilterFactory"
>                                        catenateWords="0"
>                                        catenateNumbers="0"
>                                        generateWordParts="1"
>                                        splitOnCaseChange="1"
>                                        generateNumberParts="1"
>                                        catenateAll="0"
>                                        preserveOriginal="1"
>                                        splitOnNumerics="0"
>                                />
>                                <filter
> class="solr.LowerCaseFilterFactory"/>
>                                <filter class="solr.StopFilterFactory"
> words="german/stopwords.txt" ignoreCase="true"
> enablePositionIncrements="true"/>
>                                <filter
> class="solr.SnowballPorterFilterFactory" language="German2"
> protected="german/protwords.txt"/>
>                                <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        </analyzer>
>                </fieldType>
>
>
> Thank you for taking a look.
>
>
> 2014-01-29 Jack Krupansky <ja...@basetechnology.com>
>
>  What field type and analyzer/tokenizer are you using?
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Thomas Michael Engelke Sent: Wednesday,
>> January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
>> finding part of fulltext field when word ends in dot
>> Hello everybody,
>>
>> we have a legacy solr installation in version 3.6.0.1. One of the indices
>> defines a field named "content" as a fulltext field where a product
>> description will reside. One of the records indexed contains the 
>> following
>> data (excerpt):
>>
>> z. B. in der Serie 26KA.
>>
>> I had the problem that searching the value "26KA" didn't find anything.
>> Using the analyzer of the adminstrative interface and using the full text
>> on one hand and "26KA" as the query string, I can see how the search
>> string
>> is transformed by the used filter factories. The
>> WordDelimiterFilterFactory
>> transforms the "26KA." into "26KA", which is displayed like this
>> (excerpt):
>>
>> 73 74  75    76
>> in der Serie 26KA.
>>             26KA
>>
>> It seems that it stripped the "26KA." of the dot. Using the option to
>> highlight matches, an analysis search of "26KA" shows the lower of the 
>> two
>> entries matches (after reaching the LowerCaseFilterFactory). However,
>> querying the index using the query interface doesn't show any matches.
>>
>> I discovered that adding an asterisk to the search seems to work, as does
>> adding the dot. I am puzzled by this, as I thought that the second added
>> entry was the word actually indexed. I've tried looking up the definition
>> of the administrative interface, but the documentation only specifies 
>> this
>> for the latest version, where the display is different and (at least in
>> the
>> sample) doesn't show such "duplication".
>>
>> Can anybody shed some light onto this?
>>
>>
> 


Re: Not finding part of fulltext field when word ends in dot

Posted by Thomas Michael Engelke <th...@gmail.com>.
I'm not sure I got my problem across. If I understand the snippet of
documentation right, autoGeneratePhraseQueries only affects queries that
result in multiple tokens, which mine does not. The version also is
3.6.0.1, and we're not planning on upgrading to any 4.x version.


2014-01-29 Jack Krupansky <ja...@basetechnology.com>

> You might want to add autoGeneratePhraseQueries="true" to your field
> type, but I don't think that would cause a break when going from 3.6 to
> 4.x. The default for that attribute changed in Solr 3.5. What release was
> your data indexed using? There may have been some subtle word delimiter
> filter changes between 3.x and 4.x.
>
> Read:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
> 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
> adsroot.itcs.umich.edu%3E
>
>
>
> -----Original Message----- From: Thomas Michael Engelke
> Sent: Wednesday, January 29, 2014 11:16 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Not finding part of fulltext field when word ends in dot
>
>
> The fieldType definition is a tad on the longer side:
>
>                <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>                        <analyzer type="index">
>                                <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>
>                                <filter
> class="solr.WordDelimiterFilterFactory"
>                                        catenateWords="1"
>                                        catenateNumbers="1"
>                                        generateNumberParts="1"
>                                        splitOnCaseChange="1"
>                                        generateWordParts="1"
>                                        catenateAll="0"
>                                        preserveOriginal="1"
>                                        splitOnNumerics="0"
>                                />
>
>                                <filter
> class="solr.LowerCaseFilterFactory"/>
>                                <filter class="solr.SynonymFilterFactory"
> synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
>                                <filter
> class="solr.DictionaryCompoundWordTokenFilterFactory"
>
> dictionary="german/german-common-nouns.txt"
>                                        minWordSize="5"
>                                        minSubwordSize="4"
>                                        maxSubwordSize="15"
>                                        onlyLongestMatch="true"
>                                />
>
>                                <filter class="solr.StopFilterFactory"
> words="german/stopwords.txt" ignoreCase="true"
> enablePositionIncrements="true"/>
>                                <filter
> class="solr.SnowballPorterFilterFactory" language="German2"
> protected="german/protwords.txt"/>
>                                <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        </analyzer>
>                        <analyzer type="query">
>                                <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>
>                                <filter
> class="solr.WordDelimiterFilterFactory"
>                                        catenateWords="0"
>                                        catenateNumbers="0"
>                                        generateWordParts="1"
>                                        splitOnCaseChange="1"
>                                        generateNumberParts="1"
>                                        catenateAll="0"
>                                        preserveOriginal="1"
>                                        splitOnNumerics="0"
>                                />
>                                <filter
> class="solr.LowerCaseFilterFactory"/>
>                                <filter class="solr.StopFilterFactory"
> words="german/stopwords.txt" ignoreCase="true"
> enablePositionIncrements="true"/>
>                                <filter
> class="solr.SnowballPorterFilterFactory" language="German2"
> protected="german/protwords.txt"/>
>                                <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        </analyzer>
>                </fieldType>
>
>
> Thank you for taking a look.
>
>
> 2014-01-29 Jack Krupansky <ja...@basetechnology.com>
>
>  What field type and analyzer/tokenizer are you using?
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Thomas Michael Engelke Sent: Wednesday,
>> January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
>> finding part of fulltext field when word ends in dot
>> Hello everybody,
>>
>> we have a legacy solr installation in version 3.6.0.1. One of the indices
>> defines a field named "content" as a fulltext field where a product
>> description will reside. One of the records indexed contains the following
>> data (excerpt):
>>
>> z. B. in der Serie 26KA.
>>
>> I had the problem that searching the value "26KA" didn't find anything.
>> Using the analyzer of the adminstrative interface and using the full text
>> on one hand and "26KA" as the query string, I can see how the search
>> string
>> is transformed by the used filter factories. The
>> WordDelimiterFilterFactory
>> transforms the "26KA." into "26KA", which is displayed like this
>> (excerpt):
>>
>> 73 74  75    76
>> in der Serie 26KA.
>>             26KA
>>
>> It seems that it stripped the "26KA." of the dot. Using the option to
>> highlight matches, an analysis search of "26KA" shows the lower of the two
>> entries matches (after reaching the LowerCaseFilterFactory). However,
>> querying the index using the query interface doesn't show any matches.
>>
>> I discovered that adding an asterisk to the search seems to work, as does
>> adding the dot. I am puzzled by this, as I thought that the second added
>> entry was the word actually indexed. I've tried looking up the definition
>> of the administrative interface, but the documentation only specifies this
>> for the latest version, where the display is different and (at least in
>> the
>> sample) doesn't show such "duplication".
>>
>> Can anybody shed some light onto this?
>>
>>
>

Re: Not finding part of fulltext field when word ends in dot

Posted by Jack Krupansky <ja...@basetechnology.com>.
You might want to add autoGeneratePhraseQueries="true" to your field type, 
but I don't think that would cause a break when going from 3.6 to 4.x. The 
default for that attribute changed in Solr 3.5. What release was your data 
indexed using? There may have been some subtle word delimiter filter changes 
between 3.x and 4.x.

Read:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.adsroot.itcs.umich.edu%3E


-----Original Message----- 
From: Thomas Michael Engelke
Sent: Wednesday, January 29, 2014 11:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Not finding part of fulltext field when word ends in dot

The fieldType definition is a tad on the longer side:

                <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
                        <analyzer type="index">
                                <tokenizer
class="solr.WhitespaceTokenizerFactory"/>

                                <filter
class="solr.WordDelimiterFilterFactory"
                                        catenateWords="1"
                                        catenateNumbers="1"
                                        generateNumberParts="1"
                                        splitOnCaseChange="1"
                                        generateWordParts="1"
                                        catenateAll="0"
                                        preserveOriginal="1"
                                        splitOnNumerics="0"
                                />

                                <filter
class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.SynonymFilterFactory"
synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
                                <filter
class="solr.DictionaryCompoundWordTokenFilterFactory"

dictionary="german/german-common-nouns.txt"
                                        minWordSize="5"
                                        minSubwordSize="4"
                                        maxSubwordSize="15"
                                        onlyLongestMatch="true"
                                />

                                <filter class="solr.StopFilterFactory"
words="german/stopwords.txt" ignoreCase="true"
enablePositionIncrements="true"/>
                                <filter
class="solr.SnowballPorterFilterFactory" language="German2"
protected="german/protwords.txt"/>
                                <filter
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                        </analyzer>
                        <analyzer type="query">
                                <tokenizer
class="solr.WhitespaceTokenizerFactory"/>

                                <filter
class="solr.WordDelimiterFilterFactory"
                                        catenateWords="0"
                                        catenateNumbers="0"
                                        generateWordParts="1"
                                        splitOnCaseChange="1"
                                        generateNumberParts="1"
                                        catenateAll="0"
                                        preserveOriginal="1"
                                        splitOnNumerics="0"
                                />
                                <filter
class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.StopFilterFactory"
words="german/stopwords.txt" ignoreCase="true"
enablePositionIncrements="true"/>
                                <filter
class="solr.SnowballPorterFilterFactory" language="German2"
protected="german/protwords.txt"/>
                                <filter
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                        </analyzer>
                </fieldType>


Thank you for taking a look.


2014-01-29 Jack Krupansky <ja...@basetechnology.com>

> What field type and analyzer/tokenizer are you using?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Thomas Michael Engelke Sent: Wednesday,
> January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
> finding part of fulltext field when word ends in dot
> Hello everybody,
>
> we have a legacy solr installation in version 3.6.0.1. One of the indices
> defines a field named "content" as a fulltext field where a product
> description will reside. One of the records indexed contains the following
> data (excerpt):
>
> z. B. in der Serie 26KA.
>
> I had the problem that searching the value "26KA" didn't find anything.
> Using the analyzer of the adminstrative interface and using the full text
> on one hand and "26KA" as the query string, I can see how the search 
> string
> is transformed by the used filter factories. The 
> WordDelimiterFilterFactory
> transforms the "26KA." into "26KA", which is displayed like this 
> (excerpt):
>
> 73 74  75    76
> in der Serie 26KA.
>             26KA
>
> It seems that it stripped the "26KA." of the dot. Using the option to
> highlight matches, an analysis search of "26KA" shows the lower of the two
> entries matches (after reaching the LowerCaseFilterFactory). However,
> querying the index using the query interface doesn't show any matches.
>
> I discovered that adding an asterisk to the search seems to work, as does
> adding the dot. I am puzzled by this, as I thought that the second added
> entry was the word actually indexed. I've tried looking up the definition
> of the administrative interface, but the documentation only specifies this
> for the latest version, where the display is different and (at least in 
> the
> sample) doesn't show such "duplication".
>
> Can anybody shed some light onto this?
> 


Re: Not finding part of fulltext field when word ends in dot

Posted by Thomas Michael Engelke <th...@gmail.com>.
The fieldType definition is a tad on the longer side:

                <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
                        <analyzer type="index">
                                <tokenizer
class="solr.WhitespaceTokenizerFactory"/>

                                <filter
class="solr.WordDelimiterFilterFactory"
                                        catenateWords="1"
                                        catenateNumbers="1"
                                        generateNumberParts="1"
                                        splitOnCaseChange="1"
                                        generateWordParts="1"
                                        catenateAll="0"
                                        preserveOriginal="1"
                                        splitOnNumerics="0"
                                />

                                <filter
class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.SynonymFilterFactory"
synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
                                <filter
class="solr.DictionaryCompoundWordTokenFilterFactory"

dictionary="german/german-common-nouns.txt"
                                        minWordSize="5"
                                        minSubwordSize="4"
                                        maxSubwordSize="15"
                                        onlyLongestMatch="true"
                                />

                                <filter class="solr.StopFilterFactory"
words="german/stopwords.txt" ignoreCase="true"
enablePositionIncrements="true"/>
                                <filter
class="solr.SnowballPorterFilterFactory" language="German2"
protected="german/protwords.txt"/>
                                <filter
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                        </analyzer>
                        <analyzer type="query">
                                <tokenizer
class="solr.WhitespaceTokenizerFactory"/>

                                <filter
class="solr.WordDelimiterFilterFactory"
                                        catenateWords="0"
                                        catenateNumbers="0"
                                        generateWordParts="1"
                                        splitOnCaseChange="1"
                                        generateNumberParts="1"
                                        catenateAll="0"
                                        preserveOriginal="1"
                                        splitOnNumerics="0"
                                />
                                <filter
class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.StopFilterFactory"
words="german/stopwords.txt" ignoreCase="true"
enablePositionIncrements="true"/>
                                <filter
class="solr.SnowballPorterFilterFactory" language="German2"
protected="german/protwords.txt"/>
                                <filter
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                        </analyzer>
                </fieldType>


Thank you for taking a look.


2014-01-29 Jack Krupansky <ja...@basetechnology.com>

> What field type and analyzer/tokenizer are you using?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Thomas Michael Engelke Sent: Wednesday,
> January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
> finding part of fulltext field when word ends in dot
> Hello everybody,
>
> we have a legacy solr installation in version 3.6.0.1. One of the indices
> defines a field named "content" as a fulltext field where a product
> description will reside. One of the records indexed contains the following
> data (excerpt):
>
> z. B. in der Serie 26KA.
>
> I had the problem that searching the value "26KA" didn't find anything.
> Using the analyzer of the adminstrative interface and using the full text
> on one hand and "26KA" as the query string, I can see how the search string
> is transformed by the used filter factories. The WordDelimiterFilterFactory
> transforms the "26KA." into "26KA", which is displayed like this (excerpt):
>
> 73 74  75    76
> in der Serie 26KA.
>             26KA
>
> It seems that it stripped the "26KA." of the dot. Using the option to
> highlight matches, an analysis search of "26KA" shows the lower of the two
> entries matches (after reaching the LowerCaseFilterFactory). However,
> querying the index using the query interface doesn't show any matches.
>
> I discovered that adding an asterisk to the search seems to work, as does
> adding the dot. I am puzzled by this, as I thought that the second added
> entry was the word actually indexed. I've tried looking up the definition
> of the administrative interface, but the documentation only specifies this
> for the latest version, where the display is different and (at least in the
> sample) doesn't show such "duplication".
>
> Can anybody shed some light onto this?
>

Re: Not finding part of fulltext field when word ends in dot

Posted by Jack Krupansky <ja...@basetechnology.com>.
What field type and analyzer/tokenizer are you using?

-- Jack Krupansky

-----Original Message----- 
From: Thomas Michael Engelke 
Sent: Wednesday, January 29, 2014 10:45 AM 
To: solr-user@lucene.apache.org 
Subject: Not finding part of fulltext field when word ends in dot 

Hello everybody,

we have a legacy solr installation in version 3.6.0.1. One of the indices
defines a field named "content" as a fulltext field where a product
description will reside. One of the records indexed contains the following
data (excerpt):

z. B. in der Serie 26KA.

I had the problem that searching the value "26KA" didn't find anything.
Using the analyzer of the adminstrative interface and using the full text
on one hand and "26KA" as the query string, I can see how the search string
is transformed by the used filter factories. The WordDelimiterFilterFactory
transforms the "26KA." into "26KA", which is displayed like this (excerpt):

73 74  75    76
in der Serie 26KA.
             26KA

It seems that it stripped the "26KA." of the dot. Using the option to
highlight matches, an analysis search of "26KA" shows the lower of the two
entries matches (after reaching the LowerCaseFilterFactory). However,
querying the index using the query interface doesn't show any matches.

I discovered that adding an asterisk to the search seems to work, as does
adding the dot. I am puzzled by this, as I thought that the second added
entry was the word actually indexed. I've tried looking up the definition
of the administrative interface, but the documentation only specifies this
for the latest version, where the display is different and (at least in the
sample) doesn't show such "duplication".

Can anybody shed some light onto this?