You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by roySolr <ro...@gmail.com> on 2011/06/10 09:51:06 UTC

WordDelimiter and stemEnglishPossessive doesn't work

Hello,

I have some problem with the wordDelimiter. My data looks like this:

mcdonald's#burgerking#Free record shop#h&m

I want to tokenize this on #. After that it has to split on whitespace. I
use the
wordDelimiter for that(can't use 2 tokenizers)

Now this works but there is one problem, it removes the '. My index looks
like this:

mcdonald
burgerking
free
record
shop
h&m

I don't want this so i use the stemEnglishPossessive. The description from
this part of the filter looks like this:

stemEnglishPossessive="1" causes trailing "'s" to be removed for each
subword.
    "Doug's" => "Doug"
    default is true ("1"); set to 0 to turn off 

My Field looks like this:

<fieldType name="Test_field" class="solr.TextField">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="#" />
    <filter class="solr.WordDelimiterFilterFactory" 
        		splitOnCaseChange="0" 
        		splitOnNumerics="0"
        		stemEnglishPossessive="0"
        		catenateWords="0"
     />
  </analyzer>
</fieldType>

It looks like the stemEnglishPossessive=0 is not working. How can i fix this
problem? Other filter? Did i forget something?

--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3047678.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WordDelimiter and stemEnglishPossessive doesn't work

Posted by roySolr <ro...@gmail.com>.

THANK YOU!!

I thought i only could use one character for the pattern.. Now i use a
regular expression:)

<tokenizer class="solr.PatternTokenizerFactory" pattern="#|\s" />

I don't need the wordDelimiter anymore. It's split on # and whitespace

dataset: mcdonald's#burgerking#Free record shop#h&m

mcdonald's
burgerking
free
record
shop
h&m

This is exactly how we want it.

--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062984.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WordDelimiter and stemEnglishPossessive doesn't work

Posted by lee carroll <le...@googlemail.com>.

do you need the word delimiter ?
#|\s
i think its just regex in the pattern tokeniser - i might be wrong though ?




On 14 June 2011 11:15, roySolr <ro...@gmail.com> wrote:
> Ok, with catenatewords the index term will be mcdonalds. But that's not what
> i want.
>
> I only use the wordDelimiter to split on whitespace. I have already used the
> PatternTokenizerFactory so i can't use the whitespacetokenizer.
>
> I want my index looks like this:
>
> dataset: mcdonald's#burgerking#Free record shop#h&m
>
> mcdonald's
> burgerking
> free
> record
> shop
> h&m
>
> Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
> splits on whitespaces and nothing more(not removing 's etc)..
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: WordDelimiter and stemEnglishPossessive doesn't work

Posted by Erick Erickson <er...@gmail.com>.

It's a little obscure, but you can use
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory

in front of WhitespaceTokenizer if you prefer. Note that
a CharFilterFactory is different than a FilterFactory, so
read carefully <G>..

Best
Erick

On Tue, Jun 14, 2011 at 6:15 AM, roySolr <ro...@gmail.com> wrote:
> Ok, with catenatewords the index term will be mcdonalds. But that's not what
> i want.
>
> I only use the wordDelimiter to split on whitespace. I have already used the
> PatternTokenizerFactory so i can't use the whitespacetokenizer.
>
> I want my index looks like this:
>
> dataset: mcdonald's#burgerking#Free record shop#h&m
>
> mcdonald's
> burgerking
> free
> record
> shop
> h&m
>
> Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
> splits on whitespaces and nothing more(not removing 's etc)..
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: WordDelimiter and stemEnglishPossessive doesn't work

Posted by roySolr <ro...@gmail.com>.

Ok, with catenatewords the index term will be mcdonalds. But that's not what
i want.

I only use the wordDelimiter to split on whitespace. I have already used the
PatternTokenizerFactory so i can't use the whitespacetokenizer.

I want my index looks like this:

dataset: mcdonald's#burgerking#Free record shop#h&m 

mcdonald's
burgerking
free
record
shop
h&m 

Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
splits on whitespaces and nothing more(not removing 's etc).. 

--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WordDelimiter and stemEnglishPossessive doesn't work

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, that is confusing. the stemEnglishPossessive=0
actually leaves the 's' in the index, just not attached to the
word. The admin/analysis page can help show this....

Setting it equal to 1 removes it entirely from the stream.

If you set catenateWords=1, you'll get "mcdonalds" in
your index if stemEnglishPosessive=0 but not if you
set stemEnglishPosessive=1.

Hope that helps
Erick

On Fri, Jun 10, 2011 at 3:51 AM, roySolr <ro...@gmail.com> wrote:
> Hello,
>
> I have some problem with the wordDelimiter. My data looks like this:
>
> mcdonald's#burgerking#Free record shop#h&m
>
> I want to tokenize this on #. After that it has to split on whitespace. I
> use the
> wordDelimiter for that(can't use 2 tokenizers)
>
> Now this works but there is one problem, it removes the '. My index looks
> like this:
>
> mcdonald
> burgerking
> free
> record
> shop
> h&m
>
> I don't want this so i use the stemEnglishPossessive. The description from
> this part of the filter looks like this:
>
> stemEnglishPossessive="1" causes trailing "'s" to be removed for each
> subword.
>    "Doug's" => "Doug"
>    default is true ("1"); set to 0 to turn off
>
> My Field looks like this:
>
> <fieldType name="Test_field" class="solr.TextField">
>  <analyzer>
>    <charFilter class="solr.HTMLStripCharFilterFactory"/>
>    <tokenizer class="solr.PatternTokenizerFactory" pattern="#" />
>    <filter class="solr.WordDelimiterFilterFactory"
>                        splitOnCaseChange="0"
>                        splitOnNumerics="0"
>                        stemEnglishPossessive="0"
>                        catenateWords="0"
>     />
>  </analyzer>
> </fieldType>
>
> It looks like the stemEnglishPossessive=0 is not working. How can i fix this
> problem? Other filter? Did i forget something?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3047678.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>