You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Aleksander Akerø <al...@gurusoft.no> on 2014/01/29 14:49:18 UTC

KeywordTokenizerFactory with whitespace

Hi

According to solr documentation the solr.KeywordTokenizerFactory should not
do any tokenizing at all, but to me it seems to be splitting on whitespace
e.g. space.

For example i have the value "FE 009" stored in the index to the field
"number", and what i search for is the exact same string "FE 009" (without
quotes). But it will return results like "EE 009", "ED 009" and similar
ones. Why is that?

I'm using the extended DisMax query parser, and "number" is the only
defined field in the qf parameter.


I want exact matches, but need to ignore case. Hence the use of
"solr.LowerCaseFilterFactory", and why I not use the default "string"
fieldType.

This is the fieldType definition:
*        <fieldType name="keyword" class="solr.TextField"
positionIncrementGap="100">*
*            <analyzer type="index">*
*                <tokenizer class="solr.KeywordTokenizerFactory"/>*
*                <filter class="solr.LowerCaseFilterFactory"/>*
*            </analyzer>*
*<analyzer type="index">*
*                <tokenizer class="solr.KeywordTokenizerFactory"/>*
*                <filter class="solr.LowerCaseFilterFactory"/>*
*            </analyzer>*
*        </fieldType>*

and this the field:
*        <field name="number" type="keyword" indexed="true" stored="true"
required="false" />*

Later if I get this to work I would also like to add the
"solr.EdgeNGramFilterFactory" to add trailing or leading wildcard matches.
E.g. return "FE 009-1", "FE 009-2" as well as "FE 009" when searching for
"FE 009". Would this be a way to do it?

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no

Re: KeywordTokenizerFactory with whitespace

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Well, while you are preparing that one, is there any reason you have
two analyzers and both are 'index' type? One would probably be query
type, no?

Regards,
  Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Jan 29, 2014 at 8:50 PM, Aleksander Akerø
<al...@gurusoft.no> wrote:
> Sorry guys, please ignore this. It was not ready to be sent but got sent
> mistakenly. Will send a proper one later on.
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-29 Aleksander Akerø <al...@gurusoft.no>
>
>> Hi
>>
>> According to solr documentation the solr.KeywordTokenizerFactory should
>> not do any tokenizing at all, but to me it seems to be splitting on
>> whitespace e.g. space.
>>
>> For example i have the value "FE 009" stored in the index to the field
>> "number", and what i search for is the exact same string "FE 009" (without
>> quotes). But it will return results like "EE 009", "ED 009" and similar
>> ones. Why is that?
>>
>> I'm using the extended DisMax query parser, and "number" is the only
>> defined field in the qf parameter.
>>
>>
>> I want exact matches, but need to ignore case. Hence the use of
>> "solr.LowerCaseFilterFactory", and why I not use the default "string"
>> fieldType.
>>
>> This is the fieldType definition:
>> *        <fieldType name="keyword" class="solr.TextField"
>> positionIncrementGap="100">*
>> *            <analyzer type="index">*
>> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> *                <filter class="solr.LowerCaseFilterFactory"/>*
>> *            </analyzer>*
>> *<analyzer type="index">*
>> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> *                <filter class="solr.LowerCaseFilterFactory"/>*
>> *            </analyzer>*
>> *        </fieldType>*
>>
>> and this the field:
>> *        <field name="number" type="keyword" indexed="true" stored="true"
>> required="false" />*
>>
>> Later if I get this to work I would also like to add the
>> "solr.EdgeNGramFilterFactory" to add trailing or leading wildcard matches.
>> E.g. return "FE 009-1", "FE 009-2" as well as "FE 009" when searching for
>> "FE 009". Would this be a way to do it?
>>
>> *Aleksander Akerø*
>> Systemkonsulent
>> Mobil: 944 89 054
>> E-post: aleksander@gurusoft.no
>>
>> *Gurusoft AS*
>> Telefon: 92 44 09 99
>> Østre Kullerød
>> www.gurusoft.no
>>

Re: KeywordTokenizerFactory with whitespace

Posted by Aleksander Akerø <al...@gurusoft.no>.

Sorry guys, please ignore this. It was not ready to be sent but got sent
mistakenly. Will send a proper one later on.

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-29 Aleksander Akerø <al...@gurusoft.no>

> Hi
>
> According to solr documentation the solr.KeywordTokenizerFactory should
> not do any tokenizing at all, but to me it seems to be splitting on
> whitespace e.g. space.
>
> For example i have the value "FE 009" stored in the index to the field
> "number", and what i search for is the exact same string "FE 009" (without
> quotes). But it will return results like "EE 009", "ED 009" and similar
> ones. Why is that?
>
> I'm using the extended DisMax query parser, and "number" is the only
> defined field in the qf parameter.
>
>
> I want exact matches, but need to ignore case. Hence the use of
> "solr.LowerCaseFilterFactory", and why I not use the default "string"
> fieldType.
>
> This is the fieldType definition:
> *        <fieldType name="keyword" class="solr.TextField"
> positionIncrementGap="100">*
> *            <analyzer type="index">*
> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> *                <filter class="solr.LowerCaseFilterFactory"/>*
> *            </analyzer>*
> *<analyzer type="index">*
> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> *                <filter class="solr.LowerCaseFilterFactory"/>*
> *            </analyzer>*
> *        </fieldType>*
>
> and this the field:
> *        <field name="number" type="keyword" indexed="true" stored="true"
> required="false" />*
>
> Later if I get this to work I would also like to add the
> "solr.EdgeNGramFilterFactory" to add trailing or leading wildcard matches.
> E.g. return "FE 009-1", "FE 009-2" as well as "FE 009" when searching for
> "FE 009". Would this be a way to do it?
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>