You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by roySolr <ro...@gmail.com> on 2011/06/10 09:51:06 UTC
WordDelimiter and stemEnglishPossessive doesn't work
Hello,
I have some problem with the wordDelimiter. My data looks like this:
mcdonald's#burgerking#Free record shop#h&m
I want to tokenize this on #. After that it has to split on whitespace. I
use the
wordDelimiter for that(can't use 2 tokenizers)
Now this works but there is one problem, it removes the '. My index looks
like this:
mcdonald
burgerking
free
record
shop
h&m
I don't want this so i use the stemEnglishPossessive. The description from
this part of the filter looks like this:
stemEnglishPossessive="1" causes trailing "'s" to be removed for each
subword.
"Doug's" => "Doug"
default is true ("1"); set to 0 to turn off
My Field looks like this:
<fieldType name="Test_field" class="solr.TextField">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.PatternTokenizerFactory" pattern="#" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="0"
splitOnNumerics="0"
stemEnglishPossessive="0"
catenateWords="0"
/>
</analyzer>
</fieldType>
It looks like the stemEnglishPossessive=0 is not working. How can i fix this
problem? Other filter? Did i forget something?
--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3047678.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WordDelimiter and stemEnglishPossessive doesn't work
Posted by roySolr <ro...@gmail.com>.
THANK YOU!!
I thought i only could use one character for the pattern.. Now i use a
regular expression:)
<tokenizer class="solr.PatternTokenizerFactory" pattern="#|\s" />
I don't need the wordDelimiter anymore. It's split on # and whitespace
dataset: mcdonald's#burgerking#Free record shop#h&m
mcdonald's
burgerking
free
record
shop
h&m
This is exactly how we want it.
--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062984.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WordDelimiter and stemEnglishPossessive doesn't work
Posted by lee carroll <le...@googlemail.com>.
do you need the word delimiter ?
#|\s
i think its just regex in the pattern tokeniser - i might be wrong though ?
On 14 June 2011 11:15, roySolr <ro...@gmail.com> wrote:
> Ok, with catenatewords the index term will be mcdonalds. But that's not what
> i want.
>
> I only use the wordDelimiter to split on whitespace. I have already used the
> PatternTokenizerFactory so i can't use the whitespacetokenizer.
>
> I want my index looks like this:
>
> dataset: mcdonald's#burgerking#Free record shop#h&m
>
> mcdonald's
> burgerking
> free
> record
> shop
> h&m
>
> Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
> splits on whitespaces and nothing more(not removing 's etc)..
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: WordDelimiter and stemEnglishPossessive doesn't work
Posted by Erick Erickson <er...@gmail.com>.
It's a little obscure, but you can use
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory
in front of WhitespaceTokenizer if you prefer. Note that
a CharFilterFactory is different than a FilterFactory, so
read carefully <G>..
Best
Erick
On Tue, Jun 14, 2011 at 6:15 AM, roySolr <ro...@gmail.com> wrote:
> Ok, with catenatewords the index term will be mcdonalds. But that's not what
> i want.
>
> I only use the wordDelimiter to split on whitespace. I have already used the
> PatternTokenizerFactory so i can't use the whitespacetokenizer.
>
> I want my index looks like this:
>
> dataset: mcdonald's#burgerking#Free record shop#h&m
>
> mcdonald's
> burgerking
> free
> record
> shop
> h&m
>
> Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
> splits on whitespaces and nothing more(not removing 's etc)..
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: WordDelimiter and stemEnglishPossessive doesn't work
Posted by roySolr <ro...@gmail.com>.
Ok, with catenatewords the index term will be mcdonalds. But that's not what
i want.
I only use the wordDelimiter to split on whitespace. I have already used the
PatternTokenizerFactory so i can't use the whitespacetokenizer.
I want my index looks like this:
dataset: mcdonald's#burgerking#Free record shop#h&m
mcdonald's
burgerking
free
record
shop
h&m
Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
splits on whitespaces and nothing more(not removing 's etc)..
--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WordDelimiter and stemEnglishPossessive doesn't work
Posted by Erick Erickson <er...@gmail.com>.
Hmmm, that is confusing. the stemEnglishPossessive=0
actually leaves the 's' in the index, just not attached to the
word. The admin/analysis page can help show this....
Setting it equal to 1 removes it entirely from the stream.
If you set catenateWords=1, you'll get "mcdonalds" in
your index if stemEnglishPosessive=0 but not if you
set stemEnglishPosessive=1.
Hope that helps
Erick
On Fri, Jun 10, 2011 at 3:51 AM, roySolr <ro...@gmail.com> wrote:
> Hello,
>
> I have some problem with the wordDelimiter. My data looks like this:
>
> mcdonald's#burgerking#Free record shop#h&m
>
> I want to tokenize this on #. After that it has to split on whitespace. I
> use the
> wordDelimiter for that(can't use 2 tokenizers)
>
> Now this works but there is one problem, it removes the '. My index looks
> like this:
>
> mcdonald
> burgerking
> free
> record
> shop
> h&m
>
> I don't want this so i use the stemEnglishPossessive. The description from
> this part of the filter looks like this:
>
> stemEnglishPossessive="1" causes trailing "'s" to be removed for each
> subword.
> "Doug's" => "Doug"
> default is true ("1"); set to 0 to turn off
>
> My Field looks like this:
>
> <fieldType name="Test_field" class="solr.TextField">
> <analyzer>
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.PatternTokenizerFactory" pattern="#" />
> <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="0"
> splitOnNumerics="0"
> stemEnglishPossessive="0"
> catenateWords="0"
> />
> </analyzer>
> </fieldType>
>
> It looks like the stemEnglishPossessive=0 is not working. How can i fix this
> problem? Other filter? Did i forget something?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3047678.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>