You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ashok <as...@qualcomm.com> on 2013/04/17 01:52:26 UTC

WordDelimiterFactory

Hi,

Why does WDF swallow all 'words' that start with a 'digit'?

My config is:

<filter class="solr.WordDelimiterFilterFactory" generateNumberParts="0"
splitOnNumerics="0" splitOnCaseChange="0" preserveOriginal="0"
protected="protwords.txt"/>

For some text like

20x-30y

I am expecting (& want) '20x' & '30y' to be returned & retained as the
tokens after WDF is done with it. But I get nothing as per the analysis
page.

Any idea why? I am using 4.1

Thanks

- ashok



--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WordDelimiterFactory

Posted by Ashok <as...@qualcomm.com>.

Yes, thank you Erick. The analysis/document handlers hold the key to deciding
the type & order of the filters to employ given one's document set, &
subject matter at hand. The finalized terms they produce for SOLR search,
mlt etc... are crucial to the quality of the results.

- ashok



--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529p4057349.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WordDelimiterFactory

Posted by Erick Erickson <er...@gmail.com>.

Ashok:

You really, _really_ need to dive into the admin/analysis page.
That'll show you exactly what WDFF (and all the other elements of your
chain) do to input tokens. Understanding the index and query-time
implications of all the settings in WDFF takes a while.

But from what you're describing, WDFF may not be what you're looking
for anyway, some of the regex filters could split, for instance, on
all non-alphanum characters.

Best
Erick

On Wed, Apr 17, 2013 at 12:25 AM, Shawn Heisey <so...@elyograg.org> wrote:
> On 4/16/2013 8:12 PM, Ashok wrote:
>> It looks like any 'word' that starts with a digit is treated as a numeric
>> string.
>>
>> Setting generateNumberParts="1" in stead of "0" seems to generate the right
>> tokens in this case but need to see if it has any other impacts on the
>> finalized token list...
>
> I have a fieldType that is using WDF with the following settings on the
> index side.  Both index and query analysis show it behaving correctly
> with terms that start with numbers, on versions 4.2.1 and 3.5.0:
>
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="1"
>           catenateNumbers="1"
>           catenateAll="0"
>           preserveOriginal="1"
>         />
>
> It has different settings on the query side, but generateNumberParts is
> 1 for both:
>
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="0"
>           catenateNumbers="0"
>           catenateAll="0"
>           preserveOriginal="0"
>         />
>
> I haven't tried it with generateNumberParts set to 0.
>
> Thanks,
> Shawn
>

Re: WordDelimiterFactory

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/16/2013 8:12 PM, Ashok wrote:
> It looks like any 'word' that starts with a digit is treated as a numeric
> string.
> 
> Setting generateNumberParts="1" in stead of "0" seems to generate the right
> tokens in this case but need to see if it has any other impacts on the
> finalized token list...

I have a fieldType that is using WDF with the following settings on the
index side.  Both index and query analysis show it behaving correctly
with terms that start with numbers, on versions 4.2.1 and 3.5.0:

        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />

It has different settings on the query side, but generateNumberParts is
1 for both:

        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="0"
          catenateNumbers="0"
          catenateAll="0"
          preserveOriginal="0"
        />

I haven't tried it with generateNumberParts set to 0.

Thanks,
Shawn

Re: WordDelimiterFactory

Posted by Ashok <as...@qualcomm.com>.

Thank you Jack, yes it is tricky.

If my text is

x20-y30

I get two nice tokens x20 & y30 that I need to keep.

But the text 20x-30y is treated differently and I get nothing.

20x-y30 gives me just 'y30'

The docs on LucidWorks say

generateNumberParts: (integer, default 1) If non-zero, splits numeric
strings at delimiters:"1947-32" ->"1947", "32"

It looks like any 'word' that starts with a digit is treated as a numeric
string.

Setting generateNumberParts="1" in stead of "0" seems to generate the right
tokens in this case but need to see if it has any other impacts on the
finalized token list...

Thanks

- ashok





--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529p4056544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WordDelimiterFactory

Posted by Jack Krupansky <ja...@basetechnology.com>.

Because you told it to!!! With: generateNumberParts="0"

WDF is tricky... tell us exactly what rules you want it to follow and then 
we can tell you how to set the options.

Maybe more to the point: why exactly do you think you want it use WDF? Not 
that there aren't good reasons, but what specifically are yours?

Generally, see the schema in the Solr example for suggested best practices. 
Copy and paste from there, or, better yet, use exactly the types that are 
there.

-- Jack Krupansky

-----Original Message----- 
From: Ashok
Sent: Tuesday, April 16, 2013 7:52 PM
To: solr-user@lucene.apache.org
Subject: WordDelimiterFactory

Hi,

Why does WDF swallow all 'words' that start with a 'digit'?

My config is:

<filter class="solr.WordDelimiterFilterFactory" generateNumberParts="0"
splitOnNumerics="0" splitOnCaseChange="0" preserveOriginal="0"
protected="protwords.txt"/>

For some text like

20x-30y

I am expecting (& want) '20x' & '30y' to be returned & retained as the
tokens after WDF is done with it. But I get nothing as per the analysis
page.

Any idea why? I am using 4.1

Thanks

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529.html
Sent from the Solr - User mailing list archive at Nabble.com.