You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Malte Hübner <hu...@innobox.de> on 2014/03/27 16:25:18 UTC

WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

I am using Solr 4.7 and have got a serious problem with
WordDelimiterFilterFactory.

WordDelimiterFilterFactory behaves different on hyphenated terms if they
contain charaters (a-Z) or characters AND numbers.



Splitting up hyphenated terms is deactivated in my configuration:



*This is the fieldType setup from my schema:*



{code}

                               <fieldType name="text"
class="solr.TextField" positionIncrementGap="100">

                                               <analyzer type="index">

                                                               <tokenizer
class="solr.WhitespaceTokenizerFactory" />

                                                               <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_de.txt" enablePositionIncrements="true" />

                                                               <filter
class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="0"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
splitOnNumerics="0" preserveOriginal="1"/>

                                                               <filter
class="solr.LowerCaseFilterFactory" />

                                               </analyzer>

                                               <analyzer type="query">

                                                               <tokenizer
class="solr.WhitespaceTokenizerFactory" />

                                                               <filter
class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt"
ignoreCase="true" expand="true" />

                                                               <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_de.txt" enablePositionIncrements="true" />

                                                               <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="1" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
preserveOriginal="1"/>

                                                               <filter
class="solr.LowerCaseFilterFactory" />

                                               </analyzer>

                               </fieldType>

{code}



The given search term is: *X-002-99-495*



WordDelimiterFilterFactory indexes the following word parts:



* X-002-99-495

* X (shouldn't be there)

* 00299495 (shouldn't be there)

* X00299495



But the 'X' should not be indexed or queried as a single term. You can see
that splitting is completely deactivated in the schema.



I can move the charater part around in the search term:



Searching for *002-abc-99-495* gives me



* 002-abc-99-495

* 002 (shouldn't be there)

* abc (shouldn't be there)

* 99495 (shouldn't be there)

* 002abc99495



Searching for Searching for *002-99-495* (no character) gives me

* 002-99-495

* 00299495

This result is what I would expect.



Any ideas?

Re: WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

Posted by Erick Erickson <er...@gmail.com>.
In think you're confusing initial tokenization with post-tokenization
operations. From here (and I admit it's a little opaque):

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

bq: Splits words into subwords and performs optional transformations
on subword groups....


Splitting up by non alpha-num, case transitions etc happens _before_
any of the operations like catenateNumbers etc. So all those
parameters are really answering the question "What should we do with
all the tokens that have _already_ been broken out"

BTW, you often get a much better sense of what actually happens if you
look at the docs for the filter rather than the factory, int this case
the WordDelimiterFilter rather than WordDelimiterFilterFactory. This
latter is not where the action is, but it's what's available for
definitions in schema.xml.

Best,
Erick

On Wed, Apr 9, 2014 at 7:38 AM, Malte Hübner <hu...@innobox.de> wrote:
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:erickerickson@gmail.com]
>> Gesendet: Samstag, 29. März 2014 16:09
>> An: solr-user@lucene.apache.org
>> Betreff: Re: WordDelimiterFilterFactory splits up hyphenated terms
> although
>> splitOnNumerics, generateWordParts and generateNumberParts are set to 0
>> (false)
>>
>> Why do you say at the indexing part:
>>
>> The given search term is: *X-002-99-495* WordDelimiterFilterFactory
> indexes
>> the following word parts:
>> * X (shouldn't be there)
>> * 00299495 (shouldn't be there)
>>
>> ??
>> You've set catenateNumbers="1" in your fieldType for the indexig part,
> so
>> WDFF is doing exactly what it should... smushing all the numbers it
> separated
>> into a single entity.
>
> This information is very interesting as I did not find anything about it
> in the docs.
> My understanding was that catenateNumbers would just catenate but not
> split up anything.
> Has this changed between Solr 1.4 and Solr 4?
> My current setup now is to just catenateAll.
>
> Thanks for your help.
>
>> And the whole _point_ of WDFF is to split on "non alpha nums" and index
> the
>> parts it splits.
>>
>> This seems like it's behaving exactly as it should.
>>
>> Or I'm missing something totally.
>>
>> Best,
>> Erick
>>
>>
>> On Thu, Mar 27, 2014 at 11:25 AM, Malte Hübner <hu...@innobox.de>
>> wrote:
>> > I am using Solr 4.7 and have got a serious problem with
>> > WordDelimiterFilterFactory.
>> >
>> > WordDelimiterFilterFactory behaves different on hyphenated terms if
>> > they contain charaters (a-Z) or characters AND numbers.
>> >
>> >
>> >
>> > Splitting up hyphenated terms is deactivated in my configuration:
>> >
>> >
>> >
>> > *This is the fieldType setup from my schema:*
>> >
>> >
>> >
>> > {code}
>> >
>> >                                <fieldType name="text"
>> > class="solr.TextField" positionIncrementGap="100">
>> >
>> >                                                <analyzer type="index">
>> >
>> >
>> > <tokenizer class="solr.WhitespaceTokenizerFactory" />
>> >
>> >                                                                <filter
>> > class="solr.StopFilterFactory" ignoreCase="true"
>> > words="lang/stopwords_de.txt" enablePositionIncrements="true" />
>> >
>> >                                                                <filter
>> > class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="0"
>> > generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
>> > splitOnNumerics="0" preserveOriginal="1"/>
>> >
>> >                                                                <filter
>> > class="solr.LowerCaseFilterFactory" />
>> >
>> >                                                </analyzer>
>> >
>> >                                                <analyzer type="query">
>> >
>> >
>> > <tokenizer class="solr.WhitespaceTokenizerFactory" />
>> >
>> >                                                                <filter
>> > class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt"
>> > ignoreCase="true" expand="true" />
>> >
>> >                                                                <filter
>> > class="solr.StopFilterFactory" ignoreCase="true"
>> > words="lang/stopwords_de.txt" enablePositionIncrements="true" />
>> >
>> >                                                                <filter
>> > class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>> > generateNumberParts="0" catenateWords="1" catenateNumbers="0"
>> > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
>> > preserveOriginal="1"/>
>> >
>> >                                                                <filter
>> > class="solr.LowerCaseFilterFactory" />
>> >
>> >                                                </analyzer>
>> >
>> >                                </fieldType>
>> >
>> > {code}
>> >
>> >
>> >
>> > The given search term is: *X-002-99-495*
>> >
>> >
>> >
>> > WordDelimiterFilterFactory indexes the following word parts:
>> >
>> >
>> >
>> > * X-002-99-495
>> >
>> > * X (shouldn't be there)
>> >
>> > * 00299495 (shouldn't be there)
>> >
>> > * X00299495
>> >
>> >
>> >
>> > But the 'X' should not be indexed or queried as a single term. You can
>> > see that splitting is completely deactivated in the schema.
>> >
>> >
>> >
>> > I can move the charater part around in the search term:
>> >
>> >
>> >
>> > Searching for *002-abc-99-495* gives me
>> >
>> >
>> >
>> > * 002-abc-99-495
>> >
>> > * 002 (shouldn't be there)
>> >
>> > * abc (shouldn't be there)
>> >
>> > * 99495 (shouldn't be there)
>> >
>> > * 002abc99495
>> >
>> >
>> >
>> > Searching for Searching for *002-99-495* (no character) gives me
>> >
>> > * 002-99-495
>> >
>> > * 00299495
>> >
>> > This result is what I would expect.
>> >
>> >
>> >
>> > Any ideas?

AW: WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

Posted by Malte Hübner <hu...@innobox.de>.
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerickson@gmail.com]
> Gesendet: Samstag, 29. März 2014 16:09
> An: solr-user@lucene.apache.org
> Betreff: Re: WordDelimiterFilterFactory splits up hyphenated terms
although
> splitOnNumerics, generateWordParts and generateNumberParts are set to 0
> (false)
>
> Why do you say at the indexing part:
>
> The given search term is: *X-002-99-495* WordDelimiterFilterFactory
indexes
> the following word parts:
> * X (shouldn't be there)
> * 00299495 (shouldn't be there)
>
> ??
> You've set catenateNumbers="1" in your fieldType for the indexig part,
so
> WDFF is doing exactly what it should... smushing all the numbers it
separated
> into a single entity.

This information is very interesting as I did not find anything about it
in the docs.
My understanding was that catenateNumbers would just catenate but not
split up anything.
Has this changed between Solr 1.4 and Solr 4?
My current setup now is to just catenateAll.

Thanks for your help.

> And the whole _point_ of WDFF is to split on "non alpha nums" and index
the
> parts it splits.
>
> This seems like it's behaving exactly as it should.
>
> Or I'm missing something totally.
>
> Best,
> Erick
>
>
> On Thu, Mar 27, 2014 at 11:25 AM, Malte Hübner <hu...@innobox.de>
> wrote:
> > I am using Solr 4.7 and have got a serious problem with
> > WordDelimiterFilterFactory.
> >
> > WordDelimiterFilterFactory behaves different on hyphenated terms if
> > they contain charaters (a-Z) or characters AND numbers.
> >
> >
> >
> > Splitting up hyphenated terms is deactivated in my configuration:
> >
> >
> >
> > *This is the fieldType setup from my schema:*
> >
> >
> >
> > {code}
> >
> >                                <fieldType name="text"
> > class="solr.TextField" positionIncrementGap="100">
> >
> >                                                <analyzer type="index">
> >
> >
> > <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >
> >                                                                <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="lang/stopwords_de.txt" enablePositionIncrements="true" />
> >
> >                                                                <filter
> > class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="0"
> > generateWordParts="0" generateNumberParts="0" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
> > splitOnNumerics="0" preserveOriginal="1"/>
> >
> >                                                                <filter
> > class="solr.LowerCaseFilterFactory" />
> >
> >                                                </analyzer>
> >
> >                                                <analyzer type="query">
> >
> >
> > <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >
> >                                                                <filter
> > class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt"
> > ignoreCase="true" expand="true" />
> >
> >                                                                <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="lang/stopwords_de.txt" enablePositionIncrements="true" />
> >
> >                                                                <filter
> > class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> > generateNumberParts="0" catenateWords="1" catenateNumbers="0"
> > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> > preserveOriginal="1"/>
> >
> >                                                                <filter
> > class="solr.LowerCaseFilterFactory" />
> >
> >                                                </analyzer>
> >
> >                                </fieldType>
> >
> > {code}
> >
> >
> >
> > The given search term is: *X-002-99-495*
> >
> >
> >
> > WordDelimiterFilterFactory indexes the following word parts:
> >
> >
> >
> > * X-002-99-495
> >
> > * X (shouldn't be there)
> >
> > * 00299495 (shouldn't be there)
> >
> > * X00299495
> >
> >
> >
> > But the 'X' should not be indexed or queried as a single term. You can
> > see that splitting is completely deactivated in the schema.
> >
> >
> >
> > I can move the charater part around in the search term:
> >
> >
> >
> > Searching for *002-abc-99-495* gives me
> >
> >
> >
> > * 002-abc-99-495
> >
> > * 002 (shouldn't be there)
> >
> > * abc (shouldn't be there)
> >
> > * 99495 (shouldn't be there)
> >
> > * 002abc99495
> >
> >
> >
> > Searching for Searching for *002-99-495* (no character) gives me
> >
> > * 002-99-495
> >
> > * 00299495
> >
> > This result is what I would expect.
> >
> >
> >
> > Any ideas?

Re: WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

Posted by Erick Erickson <er...@gmail.com>.
Why do you say at the indexing part:

The given search term is: *X-002-99-495*
WordDelimiterFilterFactory indexes the following word parts:
* X (shouldn't be there)
* 00299495 (shouldn't be there)

??
You've set catenateNumbers="1" in your fieldType for the indexig part,
so WDFF is doing exactly what it should... smushing all the numbers it
separated into a single entity.

And the whole _point_ of WDFF is to split on "non alpha nums" and
index the parts it splits.

This seems like it's behaving exactly as it should.

Or I'm missing something totally.

Best,
Erick


On Thu, Mar 27, 2014 at 11:25 AM, Malte Hübner <hu...@innobox.de> wrote:
> I am using Solr 4.7 and have got a serious problem with
> WordDelimiterFilterFactory.
>
> WordDelimiterFilterFactory behaves different on hyphenated terms if they
> contain charaters (a-Z) or characters AND numbers.
>
>
>
> Splitting up hyphenated terms is deactivated in my configuration:
>
>
>
> *This is the fieldType setup from my schema:*
>
>
>
> {code}
>
>                                <fieldType name="text"
> class="solr.TextField" positionIncrementGap="100">
>
>                                                <analyzer type="index">
>
>                                                                <tokenizer
> class="solr.WhitespaceTokenizerFactory" />
>
>                                                                <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" enablePositionIncrements="true" />
>
>                                                                <filter
> class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="0"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
> splitOnNumerics="0" preserveOriginal="1"/>
>
>                                                                <filter
> class="solr.LowerCaseFilterFactory" />
>
>                                                </analyzer>
>
>                                                <analyzer type="query">
>
>                                                                <tokenizer
> class="solr.WhitespaceTokenizerFactory" />
>
>                                                                <filter
> class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt"
> ignoreCase="true" expand="true" />
>
>                                                                <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" enablePositionIncrements="true" />
>
>                                                                <filter
> class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> generateNumberParts="0" catenateWords="1" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> preserveOriginal="1"/>
>
>                                                                <filter
> class="solr.LowerCaseFilterFactory" />
>
>                                                </analyzer>
>
>                                </fieldType>
>
> {code}
>
>
>
> The given search term is: *X-002-99-495*
>
>
>
> WordDelimiterFilterFactory indexes the following word parts:
>
>
>
> * X-002-99-495
>
> * X (shouldn't be there)
>
> * 00299495 (shouldn't be there)
>
> * X00299495
>
>
>
> But the 'X' should not be indexed or queried as a single term. You can see
> that splitting is completely deactivated in the schema.
>
>
>
> I can move the charater part around in the search term:
>
>
>
> Searching for *002-abc-99-495* gives me
>
>
>
> * 002-abc-99-495
>
> * 002 (shouldn't be there)
>
> * abc (shouldn't be there)
>
> * 99495 (shouldn't be there)
>
> * 002abc99495
>
>
>
> Searching for Searching for *002-99-495* (no character) gives me
>
> * 002-99-495
>
> * 00299495
>
> This result is what I would expect.
>
>
>
> Any ideas?