You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Aleksander Akerø <al...@gurusoft.no> on 2014/01/29 15:05:08 UTC

KeywordTokenizerFactory - trouble with "exact" matches

Hi, I'll try properly this time.

According to solr documentation the solr.KeywordTokenizerFactory should not
do any tokenizing at all. Thus, if I understand this correctly, it should
only return exact matches given that this is the only analyzer defined in
the field type. Such as the following config:

Fieldtypes:
*       <fieldType name="keyword" class="solr.TextField"
positionIncrementGap="100">*
*            <analyzer type="index">*
*                <tokenizer class="solr.KeywordTokenizerFactory"/>*
*                <filter class="solr.LowerCaseFilterFactory"/>*
*            </analyzer>*
*            <analyzer type="query">*
*                <tokenizer class="solr.KeywordTokenizerFactory"/>*
*                <filter class="solr.LowerCaseFilterFactory"/>*
*            </analyzer>*
*        </fieldType>*

Fields:
*        <field name="number" type="keyword" indexed="true" stored="true"
required="false" />*

But it seems not to be this way for me. In the index i have values like "FE
009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of course. But
when i search "FE 009" (without quotes), I get no results. It seems that I
have to add quotes to the searchquery in order to retrieve any results, but
that wont't work for me, as I later on have to expand the index with other
fields that need whitespace-tokenization and such, or would that work
regardless of quotes? I have come to understand that wrapping the query in
quotes forces it to be analyzed as one token, no matter what.

If I get this to work I would also like to add the
"solr.EdgeNGramFilterFactory" to the index side analyzer, thus adding
trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as well as
"FE 009" when searching for "FE 009", but not "EE 009", and "ED 009". Would
that be an ok way to do it?

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Srinivasa7 <sr...@googlemail.com>.

Aleksander Akerø 
It would be great if you can share the solution how you are handling it on
field basis



--
View this message in context: http://lucene.472066.n3.nabble.com/KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114435.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Aleksander Akerø <al...@gurusoft.no>.

I've come across something like this as well, can't remember where, but it
was often related to synonym functionality.

The following link shows a 3rd party QueryParser that seems to deal with
synonyms alongside edismax, and may be interesting to look at:
http://wiki.apache.org/solr/QueryParser

It is also mentioned as an issue while using the synonymFilterFactory:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
"The Lucene QueryParser tokenizes on white space before giving any text to
the Analyzer, so if a person searches for the words sea biscit the analyzer
will be given the words "sea" and "biscit" seperately, and will not know
that they match a synonym".

Maybe the extended support for synonym handling is what will give us the
solution one day. For now I have solved my problem and will leave it at
that.

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-30 Jack Krupansky <ja...@basetechnology.com>:

> I vaguely recall that there was a Jira floating around for multi-word
> synonyms that dealt with parsing of spaces as well. And Robert Muir has
> (repeatedly) referred to this query parser feature as a "bug". Somehow,
> eventually, I think it will be dealt with, but the "difficulty" remains for
> now.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Aleksander Akerø
> Sent: Thursday, January 30, 2014 9:31 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>
> Yes, I actually noted that about the filter vs. tokenizer. It's easy to get
> confused if you don't have a good understanding of the differences between
> tokenizers and filters.
>
> As for the query parser problem, there's always a workaround, but it was
> nice to be made aware of. It sort of was a ghost-like problem before.
> Allthough it would be great to have the opportunity to "disable" the
> splitting on whitespace even for DisMax, I understand that it probably not
> the most wanted feature for next solr release :)
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-30 Erick Erickson <er...@gmail.com>:
>
>  Note, the comments about lowercasetokenizer were a red herring. You were
>> using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it
>> would
>> just do what you expected, lowercase the entire input. You would have used
>> LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a
>> Filter.
>>
>> As for the rest, I expect Jack is right, it's the query parsing above
>> the field input.
>>
>> Best
>> Erick
>>
>> On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø
>> <al...@gurusoft.no> wrote:
>> > Hi Srinivasa
>> >
>> > Yes I've come to understand that the analyzers will never "see" the
>> > whitespace, thus no need for patternreplacement, like Jack points out.
>> > So
>> > the solution would be to set wich parser to use for the query. Also Jack
>> > has pointed out that the "field" queryparser should work in this
>> particular
>> > setting -> http://wiki.apache.org/solr/QueryParser
>> >
>> > My problem was though, that it was only for one of the fields in the
>> schema
>> > that i needed this for, but for all the other fields, e.g. name,
>> > description etc., I would very much like to make use of the eDisMax
>> > functionality. And it seems that there can only be defined one query
>> parser
>> > per query. in other words: for all fields. Jack, you may correct me if
>> I'm
>> > wrong here :)
>> >
>> > This particular customer wanted a wildcard search at both ends of the
>> > phrase, and that sort of ambiguated the problem. And therefore I chose
>> > to
>> > replace all whitespace for this field in sql at index time, using the
>> DIH.
>> > And then using EdgeNGramFilterFactory on both sides of the keyword like
>> the
>> > config below, and that seemed to work pretty nicely.
>> >
>> > <!-- #### WildCard search number #### --> <fieldType name="keyword"
>> class=
>> > "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
>> > tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
>> > "solr.LowerCaseFilterFactory"/> <filter
>> class="solr.EdgeNGramFilterFactory"
>> > minGramSize="2" maxGramSize="25" side="front"/> <filter class=
>> > "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"
>> side="back"/>
>> > </analyzer> <analyzer type="query"> <tokenizer class=
>> > "solr.KeywordTokenizerFactory"/> <filter
>> class="solr.LowerCaseFilterFactory"
>> > /> </analyzer> </fieldType>
>> >
>> > I also added a bit of extra weighting for the "keyword" field so that
>> exact
>> > matches recieved a higher score.
>> >
>> > What this solution doesn't do is to exclude values like "EE 009", when
>> > searching for "FE 009", but they return far down on the list, which for
>> the
>> > customer is ok, because usually these results are somewhat related og
>> > within the same category.
>> >
>> > *Aleksander Akerø*
>> > Systemkonsulent
>> > Mobil: 944 89 054
>> > E-post: aleksander@gurusoft.no
>> >
>> > *Gurusoft AS*
>> > Telefon: 92 44 09 99
>> > Østre Kullerød
>> > www.gurusoft.no
>> >
>> >
>> > 2014-01-30 Jack Krupansky <ja...@basetechnology.com>
>> >
>> >> The standard, keyword-oriented query parsers will all treat unquoted,
>> >> unescaped white space as term delimiters and ignore the what space.
>> There
>> >> is no way to bypass that behavior. So, your regex will never even see
>> the
>> >> white space - unless you enclose the text and white space in quotes or
>> use
>> >> a backslash to quote each white space character.
>> >>
>> >> You can use the "field" and "term" query parsers to pass a query string
>> as
>> >> if it were fully enclosed in quotes, but that only handles a single >>
>> term
>> >> and does not allow for multiple terms or any query operators. For
>> example:
>> >>
>> >> {!field f=myfield}Foo Bar
>> >>
>> >> See:
>> >> http://wiki.apache.org/solr/QueryParser
>> >>
>> >> You can also pre-configure the field query parser with the >>
>> defType=field
>> >> parameter.
>> >>
>> >> -- Jack Krupansky
>> >>
>> >>
>> >> -----Original Message----- From: Srinivasa7
>> >> Sent: Thursday, January 30, 2014 6:37 AM
>> >>
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>> >>
>> >> Hi,
>> >>
>> >> I  have similar kind of problem  where I want search for a words with
>> >> spaces
>> >> in that. And I wanted to search by stripping all the spaces .
>> >>
>> >> I have used following schema for that
>> >>
>> >> <fieldType name="nospaces" class="solr.TextField"
>> >> autoGeneratePhraseQueries="true"  >
>> >>            <analyzer type="index">
>> >>              <tokenizer class="solr.KeywordTokenizerFactory"/>
>> >>                <filter class="solr.LowerCaseFilterFactory"/>
>> >>                <filter class="solr.PatternReplaceFilterFactory"
>> >> pattern="[^\w]+"  replacement="" replace="all"/>
>> >>            </analyzer>
>> >>            <analyzer type="query">
>> >>
>> >>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>> >>                <filter class="solr.LowerCaseFilterFactory"/>
>> >>                <filter class="solr.PatternReplaceFilterFactory"
>> >> pattern="[^\w]+"  replacement="" replace="all"/>
>> >>            </analyzer>
>> >>        </fieldType>
>> >>
>> >>
>> >> And
>> >>
>> >>
>> >> <field name="text_nospaces" type="nospaces"  indexed="true"
>> stored="true"
>> >> omitNorms="true" />
>> >>        <copyField source="text" dest="text_nospaces" />
>> >>
>> >>
>> >>
>> >> But it is not searching the right terms . we are stripping the spaces
>> and
>> >> indexing lowercase values when we do that.
>> >>
>> >>
>> >> Like : East Enders
>> >>
>> >> when I seach for   'east end ers'  text, its not returning any values
>> >> saying
>> >> no document found.
>> >>
>> >> I realised the solr uses QueryParser before passing query string to the
>> >> QueryAnalyzer in defined in schema.
>> >>
>> >> And The Query parser is tokenizing the query string providing in query
>> . So
>> >> it is sending each token to the QueryAnalyser that is defined in >>
>> schema.
>> >>
>> >>
>> >> SO is there anyway that I can by pass this query parser or use a >>
>> correct
>> >> query processor which can consider the entire string as single pharse.
>> >>
>> >> At the moment I am using dismax query processor.
>> >>
>> >> Any suggestion would be much appreciated.
>> >>
>> >> Thanks
>> >> Srinivasa
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context: http://lucene.472066.n3.nabble.com/
>> >>
>> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>>
>>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Jack Krupansky <ja...@basetechnology.com>.

I vaguely recall that there was a Jira floating around for multi-word 
synonyms that dealt with parsing of spaces as well. And Robert Muir has 
(repeatedly) referred to this query parser feature as a "bug". Somehow, 
eventually, I think it will be dealt with, but the "difficulty" remains for 
now.

-- Jack Krupansky

-----Original Message----- 
From: Aleksander Akerø
Sent: Thursday, January 30, 2014 9:31 AM
To: solr-user@lucene.apache.org
Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches

Yes, I actually noted that about the filter vs. tokenizer. It's easy to get
confused if you don't have a good understanding of the differences between
tokenizers and filters.

As for the query parser problem, there's always a workaround, but it was
nice to be made aware of. It sort of was a ghost-like problem before.
Allthough it would be great to have the opportunity to "disable" the
splitting on whitespace even for DisMax, I understand that it probably not
the most wanted feature for next solr release :)

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-30 Erick Erickson <er...@gmail.com>:

> Note, the comments about lowercasetokenizer were a red herring. You were
> using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it
> would
> just do what you expected, lowercase the entire input. You would have used
> LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a
> Filter.
>
> As for the rest, I expect Jack is right, it's the query parsing above
> the field input.
>
> Best
> Erick
>
> On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø
> <al...@gurusoft.no> wrote:
> > Hi Srinivasa
> >
> > Yes I've come to understand that the analyzers will never "see" the
> > whitespace, thus no need for patternreplacement, like Jack points out. 
> > So
> > the solution would be to set wich parser to use for the query. Also Jack
> > has pointed out that the "field" queryparser should work in this
> particular
> > setting -> http://wiki.apache.org/solr/QueryParser
> >
> > My problem was though, that it was only for one of the fields in the
> schema
> > that i needed this for, but for all the other fields, e.g. name,
> > description etc., I would very much like to make use of the eDisMax
> > functionality. And it seems that there can only be defined one query
> parser
> > per query. in other words: for all fields. Jack, you may correct me if
> I'm
> > wrong here :)
> >
> > This particular customer wanted a wildcard search at both ends of the
> > phrase, and that sort of ambiguated the problem. And therefore I chose 
> > to
> > replace all whitespace for this field in sql at index time, using the
> DIH.
> > And then using EdgeNGramFilterFactory on both sides of the keyword like
> the
> > config below, and that seemed to work pretty nicely.
> >
> > <!-- #### WildCard search number #### --> <fieldType name="keyword"
> class=
> > "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
> > tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
> > "solr.LowerCaseFilterFactory"/> <filter
> class="solr.EdgeNGramFilterFactory"
> > minGramSize="2" maxGramSize="25" side="front"/> <filter class=
> > "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"
> side="back"/>
> > </analyzer> <analyzer type="query"> <tokenizer class=
> > "solr.KeywordTokenizerFactory"/> <filter
> class="solr.LowerCaseFilterFactory"
> > /> </analyzer> </fieldType>
> >
> > I also added a bit of extra weighting for the "keyword" field so that
> exact
> > matches recieved a higher score.
> >
> > What this solution doesn't do is to exclude values like "EE 009", when
> > searching for "FE 009", but they return far down on the list, which for
> the
> > customer is ok, because usually these results are somewhat related og
> > within the same category.
> >
> > *Aleksander Akerø*
> > Systemkonsulent
> > Mobil: 944 89 054
> > E-post: aleksander@gurusoft.no
> >
> > *Gurusoft AS*
> > Telefon: 92 44 09 99
> > Østre Kullerød
> > www.gurusoft.no
> >
> >
> > 2014-01-30 Jack Krupansky <ja...@basetechnology.com>
> >
> >> The standard, keyword-oriented query parsers will all treat unquoted,
> >> unescaped white space as term delimiters and ignore the what space.
> There
> >> is no way to bypass that behavior. So, your regex will never even see
> the
> >> white space - unless you enclose the text and white space in quotes or
> use
> >> a backslash to quote each white space character.
> >>
> >> You can use the "field" and "term" query parsers to pass a query string
> as
> >> if it were fully enclosed in quotes, but that only handles a single 
> >> term
> >> and does not allow for multiple terms or any query operators. For
> example:
> >>
> >> {!field f=myfield}Foo Bar
> >>
> >> See:
> >> http://wiki.apache.org/solr/QueryParser
> >>
> >> You can also pre-configure the field query parser with the 
> >> defType=field
> >> parameter.
> >>
> >> -- Jack Krupansky
> >>
> >>
> >> -----Original Message----- From: Srinivasa7
> >> Sent: Thursday, January 30, 2014 6:37 AM
> >>
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
> >>
> >> Hi,
> >>
> >> I  have similar kind of problem  where I want search for a words with
> >> spaces
> >> in that. And I wanted to search by stripping all the spaces .
> >>
> >> I have used following schema for that
> >>
> >> <fieldType name="nospaces" class="solr.TextField"
> >> autoGeneratePhraseQueries="true"  >
> >>            <analyzer type="index">
> >>              <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>                <filter class="solr.LowerCaseFilterFactory"/>
> >>                <filter class="solr.PatternReplaceFilterFactory"
> >> pattern="[^\w]+"  replacement="" replace="all"/>
> >>            </analyzer>
> >>            <analyzer type="query">
> >>
> >>                <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>                <filter class="solr.LowerCaseFilterFactory"/>
> >>                <filter class="solr.PatternReplaceFilterFactory"
> >> pattern="[^\w]+"  replacement="" replace="all"/>
> >>            </analyzer>
> >>        </fieldType>
> >>
> >>
> >> And
> >>
> >>
> >> <field name="text_nospaces" type="nospaces"  indexed="true"
> stored="true"
> >> omitNorms="true" />
> >>        <copyField source="text" dest="text_nospaces" />
> >>
> >>
> >>
> >> But it is not searching the right terms . we are stripping the spaces
> and
> >> indexing lowercase values when we do that.
> >>
> >>
> >> Like : East Enders
> >>
> >> when I seach for   'east end ers'  text, its not returning any values
> >> saying
> >> no document found.
> >>
> >> I realised the solr uses QueryParser before passing query string to the
> >> QueryAnalyzer in defined in schema.
> >>
> >> And The Query parser is tokenizing the query string providing in query
> . So
> >> it is sending each token to the QueryAnalyser that is defined in 
> >> schema.
> >>
> >>
> >> SO is there anyway that I can by pass this query parser or use a 
> >> correct
> >> query processor which can consider the entire string as single pharse.
> >>
> >> At the moment I am using dismax query processor.
> >>
> >> Any suggestion would be much appreciated.
> >>
> >> Thanks
> >> Srinivasa
> >>
> >>
> >>
> >> --
> >> View this message in context: http://lucene.472066.n3.nabble.com/
> >>
> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Aleksander Akerø <al...@gurusoft.no>.

Yes, I actually noted that about the filter vs. tokenizer. It's easy to get
confused if you don't have a good understanding of the differences between
tokenizers and filters.

As for the query parser problem, there's always a workaround, but it was
nice to be made aware of. It sort of was a ghost-like problem before.
Allthough it would be great to have the opportunity to "disable" the
splitting on whitespace even for DisMax, I understand that it probably not
the most wanted feature for next solr release :)

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-30 Erick Erickson <er...@gmail.com>:

> Note, the comments about lowercasetokenizer were a red herring. You were
> using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it
> would
> just do what you expected, lowercase the entire input. You would have used
> LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a
> Filter.
>
> As for the rest, I expect Jack is right, it's the query parsing above
> the field input.
>
> Best
> Erick
>
> On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø
> <al...@gurusoft.no> wrote:
> > Hi Srinivasa
> >
> > Yes I've come to understand that the analyzers will never "see" the
> > whitespace, thus no need for patternreplacement, like Jack points out. So
> > the solution would be to set wich parser to use for the query. Also Jack
> > has pointed out that the "field" queryparser should work in this
> particular
> > setting -> http://wiki.apache.org/solr/QueryParser
> >
> > My problem was though, that it was only for one of the fields in the
> schema
> > that i needed this for, but for all the other fields, e.g. name,
> > description etc., I would very much like to make use of the eDisMax
> > functionality. And it seems that there can only be defined one query
> parser
> > per query. in other words: for all fields. Jack, you may correct me if
> I'm
> > wrong here :)
> >
> > This particular customer wanted a wildcard search at both ends of the
> > phrase, and that sort of ambiguated the problem. And therefore I chose to
> > replace all whitespace for this field in sql at index time, using the
> DIH.
> > And then using EdgeNGramFilterFactory on both sides of the keyword like
> the
> > config below, and that seemed to work pretty nicely.
> >
> > <!-- #### WildCard search number #### --> <fieldType name="keyword"
> class=
> > "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
> > tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
> > "solr.LowerCaseFilterFactory"/> <filter
> class="solr.EdgeNGramFilterFactory"
> > minGramSize="2" maxGramSize="25" side="front"/> <filter class=
> > "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"
> side="back"/>
> > </analyzer> <analyzer type="query"> <tokenizer class=
> > "solr.KeywordTokenizerFactory"/> <filter
> class="solr.LowerCaseFilterFactory"
> > /> </analyzer> </fieldType>
> >
> > I also added a bit of extra weighting for the "keyword" field so that
> exact
> > matches recieved a higher score.
> >
> > What this solution doesn't do is to exclude values like "EE 009", when
> > searching for "FE 009", but they return far down on the list, which for
> the
> > customer is ok, because usually these results are somewhat related og
> > within the same category.
> >
> > *Aleksander Akerø*
> > Systemkonsulent
> > Mobil: 944 89 054
> > E-post: aleksander@gurusoft.no
> >
> > *Gurusoft AS*
> > Telefon: 92 44 09 99
> > Østre Kullerød
> > www.gurusoft.no
> >
> >
> > 2014-01-30 Jack Krupansky <ja...@basetechnology.com>
> >
> >> The standard, keyword-oriented query parsers will all treat unquoted,
> >> unescaped white space as term delimiters and ignore the what space.
> There
> >> is no way to bypass that behavior. So, your regex will never even see
> the
> >> white space - unless you enclose the text and white space in quotes or
> use
> >> a backslash to quote each white space character.
> >>
> >> You can use the "field" and "term" query parsers to pass a query string
> as
> >> if it were fully enclosed in quotes, but that only handles a single term
> >> and does not allow for multiple terms or any query operators. For
> example:
> >>
> >> {!field f=myfield}Foo Bar
> >>
> >> See:
> >> http://wiki.apache.org/solr/QueryParser
> >>
> >> You can also pre-configure the field query parser with the defType=field
> >> parameter.
> >>
> >> -- Jack Krupansky
> >>
> >>
> >> -----Original Message----- From: Srinivasa7
> >> Sent: Thursday, January 30, 2014 6:37 AM
> >>
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
> >>
> >> Hi,
> >>
> >> I  have similar kind of problem  where I want search for a words with
> >> spaces
> >> in that. And I wanted to search by stripping all the spaces .
> >>
> >> I have used following schema for that
> >>
> >> <fieldType name="nospaces" class="solr.TextField"
> >> autoGeneratePhraseQueries="true"  >
> >>            <analyzer type="index">
> >>              <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>                <filter class="solr.LowerCaseFilterFactory"/>
> >>                <filter class="solr.PatternReplaceFilterFactory"
> >> pattern="[^\w]+"  replacement="" replace="all"/>
> >>            </analyzer>
> >>            <analyzer type="query">
> >>
> >>                <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>                <filter class="solr.LowerCaseFilterFactory"/>
> >>                <filter class="solr.PatternReplaceFilterFactory"
> >> pattern="[^\w]+"  replacement="" replace="all"/>
> >>            </analyzer>
> >>        </fieldType>
> >>
> >>
> >> And
> >>
> >>
> >> <field name="text_nospaces" type="nospaces"  indexed="true"
> stored="true"
> >> omitNorms="true" />
> >>        <copyField source="text" dest="text_nospaces" />
> >>
> >>
> >>
> >> But it is not searching the right terms . we are stripping the spaces
> and
> >> indexing lowercase values when we do that.
> >>
> >>
> >> Like : East Enders
> >>
> >> when I seach for   'east end ers'  text, its not returning any values
> >> saying
> >> no document found.
> >>
> >> I realised the solr uses QueryParser before passing query string to the
> >> QueryAnalyzer in defined in schema.
> >>
> >> And The Query parser is tokenizing the query string providing in query
> . So
> >> it is sending each token to the QueryAnalyser that is defined in schema.
> >>
> >>
> >> SO is there anyway that I can by pass this query parser or use a correct
> >> query processor which can consider the entire string as single pharse.
> >>
> >> At the moment I am using dismax query processor.
> >>
> >> Any suggestion would be much appreciated.
> >>
> >> Thanks
> >> Srinivasa
> >>
> >>
> >>
> >> --
> >> View this message in context: http://lucene.472066.n3.nabble.com/
> >>
> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Erick Erickson <er...@gmail.com>.

Note, the comments about lowercasetokenizer were a red herring. You were
using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it would
just do what you expected, lowercase the entire input. You would have used
LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a Filter.

As for the rest, I expect Jack is right, it's the query parsing above
the field input.

Best
Erick

On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø
<al...@gurusoft.no> wrote:
> Hi Srinivasa
>
> Yes I've come to understand that the analyzers will never "see" the
> whitespace, thus no need for patternreplacement, like Jack points out. So
> the solution would be to set wich parser to use for the query. Also Jack
> has pointed out that the "field" queryparser should work in this particular
> setting -> http://wiki.apache.org/solr/QueryParser
>
> My problem was though, that it was only for one of the fields in the schema
> that i needed this for, but for all the other fields, e.g. name,
> description etc., I would very much like to make use of the eDisMax
> functionality. And it seems that there can only be defined one query parser
> per query. in other words: for all fields. Jack, you may correct me if I'm
> wrong here :)
>
> This particular customer wanted a wildcard search at both ends of the
> phrase, and that sort of ambiguated the problem. And therefore I chose to
> replace all whitespace for this field in sql at index time, using the DIH.
> And then using EdgeNGramFilterFactory on both sides of the keyword like the
> config below, and that seemed to work pretty nicely.
>
> <!-- #### WildCard search number #### --> <fieldType name="keyword" class=
> "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
> tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
> "solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory"
> minGramSize="2" maxGramSize="25" side="front"/> <filter class=
> "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" side="back"/>
> </analyzer> <analyzer type="query"> <tokenizer class=
> "solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"
> /> </analyzer> </fieldType>
>
> I also added a bit of extra weighting for the "keyword" field so that exact
> matches recieved a higher score.
>
> What this solution doesn't do is to exclude values like "EE 009", when
> searching for "FE 009", but they return far down on the list, which for the
> customer is ok, because usually these results are somewhat related og
> within the same category.
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-30 Jack Krupansky <ja...@basetechnology.com>
>
>> The standard, keyword-oriented query parsers will all treat unquoted,
>> unescaped white space as term delimiters and ignore the what space. There
>> is no way to bypass that behavior. So, your regex will never even see the
>> white space - unless you enclose the text and white space in quotes or use
>> a backslash to quote each white space character.
>>
>> You can use the "field" and "term" query parsers to pass a query string as
>> if it were fully enclosed in quotes, but that only handles a single term
>> and does not allow for multiple terms or any query operators. For example:
>>
>> {!field f=myfield}Foo Bar
>>
>> See:
>> http://wiki.apache.org/solr/QueryParser
>>
>> You can also pre-configure the field query parser with the defType=field
>> parameter.
>>
>> -- Jack Krupansky
>>
>>
>> -----Original Message----- From: Srinivasa7
>> Sent: Thursday, January 30, 2014 6:37 AM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>>
>> Hi,
>>
>> I  have similar kind of problem  where I want search for a words with
>> spaces
>> in that. And I wanted to search by stripping all the spaces .
>>
>> I have used following schema for that
>>
>> <fieldType name="nospaces" class="solr.TextField"
>> autoGeneratePhraseQueries="true"  >
>>            <analyzer type="index">
>>              <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[^\w]+"  replacement="" replace="all"/>
>>            </analyzer>
>>            <analyzer type="query">
>>
>>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[^\w]+"  replacement="" replace="all"/>
>>            </analyzer>
>>        </fieldType>
>>
>>
>> And
>>
>>
>> <field name="text_nospaces" type="nospaces"  indexed="true" stored="true"
>> omitNorms="true" />
>>        <copyField source="text" dest="text_nospaces" />
>>
>>
>>
>> But it is not searching the right terms . we are stripping the spaces and
>> indexing lowercase values when we do that.
>>
>>
>> Like : East Enders
>>
>> when I seach for   'east end ers'  text, its not returning any values
>> saying
>> no document found.
>>
>> I realised the solr uses QueryParser before passing query string to the
>> QueryAnalyzer in defined in schema.
>>
>> And The Query parser is tokenizing the query string providing in query . So
>> it is sending each token to the QueryAnalyser that is defined in schema.
>>
>>
>> SO is there anyway that I can by pass this query parser or use a correct
>> query processor which can consider the entire string as single pharse.
>>
>> At the moment I am using dismax query processor.
>>
>> Any suggestion would be much appreciated.
>>
>> Thanks
>> Srinivasa
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/
>> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Aleksander Akerø <al...@gurusoft.no>.

Hi Srinivasa

Yes I've come to understand that the analyzers will never "see" the
whitespace, thus no need for patternreplacement, like Jack points out. So
the solution would be to set wich parser to use for the query. Also Jack
has pointed out that the "field" queryparser should work in this particular
setting -> http://wiki.apache.org/solr/QueryParser

My problem was though, that it was only for one of the fields in the schema
that i needed this for, but for all the other fields, e.g. name,
description etc., I would very much like to make use of the eDisMax
functionality. And it seems that there can only be defined one query parser
per query. in other words: for all fields. Jack, you may correct me if I'm
wrong here :)

This particular customer wanted a wildcard search at both ends of the
phrase, and that sort of ambiguated the problem. And therefore I chose to
replace all whitespace for this field in sql at index time, using the DIH.
And then using EdgeNGramFilterFactory on both sides of the keyword like the
config below, and that seemed to work pretty nicely.

<!-- #### WildCard search number #### --> <fieldType name="keyword" class=
"solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
"solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory"
minGramSize="2" maxGramSize="25" side="front"/> <filter class=
"solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" side="back"/>
</analyzer> <analyzer type="query"> <tokenizer class=
"solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"
/> </analyzer> </fieldType>

I also added a bit of extra weighting for the "keyword" field so that exact
matches recieved a higher score.

What this solution doesn't do is to exclude values like "EE 009", when
searching for "FE 009", but they return far down on the list, which for the
customer is ok, because usually these results are somewhat related og
within the same category.

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-30 Jack Krupansky <ja...@basetechnology.com>

> The standard, keyword-oriented query parsers will all treat unquoted,
> unescaped white space as term delimiters and ignore the what space. There
> is no way to bypass that behavior. So, your regex will never even see the
> white space - unless you enclose the text and white space in quotes or use
> a backslash to quote each white space character.
>
> You can use the "field" and "term" query parsers to pass a query string as
> if it were fully enclosed in quotes, but that only handles a single term
> and does not allow for multiple terms or any query operators. For example:
>
> {!field f=myfield}Foo Bar
>
> See:
> http://wiki.apache.org/solr/QueryParser
>
> You can also pre-configure the field query parser with the defType=field
> parameter.
>
> -- Jack Krupansky
>
>
> -----Original Message----- From: Srinivasa7
> Sent: Thursday, January 30, 2014 6:37 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>
> Hi,
>
> I  have similar kind of problem  where I want search for a words with
> spaces
> in that. And I wanted to search by stripping all the spaces .
>
> I have used following schema for that
>
> <fieldType name="nospaces" class="solr.TextField"
> autoGeneratePhraseQueries="true"  >
>            <analyzer type="index">
>              <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.PatternReplaceFilterFactory"
> pattern="[^\w]+"  replacement="" replace="all"/>
>            </analyzer>
>            <analyzer type="query">
>
>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.PatternReplaceFilterFactory"
> pattern="[^\w]+"  replacement="" replace="all"/>
>            </analyzer>
>        </fieldType>
>
>
> And
>
>
> <field name="text_nospaces" type="nospaces"  indexed="true" stored="true"
> omitNorms="true" />
>        <copyField source="text" dest="text_nospaces" />
>
>
>
> But it is not searching the right terms . we are stripping the spaces and
> indexing lowercase values when we do that.
>
>
> Like : East Enders
>
> when I seach for   'east end ers'  text, its not returning any values
> saying
> no document found.
>
> I realised the solr uses QueryParser before passing query string to the
> QueryAnalyzer in defined in schema.
>
> And The Query parser is tokenizing the query string providing in query . So
> it is sending each token to the QueryAnalyser that is defined in schema.
>
>
> SO is there anyway that I can by pass this query parser or use a correct
> query processor which can consider the entire string as single pharse.
>
> At the moment I am using dismax query processor.
>
> Any suggestion would be much appreciated.
>
> Thanks
> Srinivasa
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/
> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Jack Krupansky <ja...@basetechnology.com>.

The standard, keyword-oriented query parsers will all treat unquoted, 
unescaped white space as term delimiters and ignore the what space. There is 
no way to bypass that behavior. So, your regex will never even see the white 
space - unless you enclose the text and white space in quotes or use a 
backslash to quote each white space character.

You can use the "field" and "term" query parsers to pass a query string as 
if it were fully enclosed in quotes, but that only handles a single term and 
does not allow for multiple terms or any query operators. For example:

{!field f=myfield}Foo Bar

See:
http://wiki.apache.org/solr/QueryParser

You can also pre-configure the field query parser with the defType=field 
parameter.

-- Jack Krupansky


-----Original Message----- 
From: Srinivasa7
Sent: Thursday, January 30, 2014 6:37 AM
To: solr-user@lucene.apache.org
Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches

Hi,

I  have similar kind of problem  where I want search for a words with spaces
in that. And I wanted to search by stripping all the spaces .

I have used following schema for that

<fieldType name="nospaces" class="solr.TextField"
autoGeneratePhraseQueries="true"  >
            <analyzer type="index">
              <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.PatternReplaceFilterFactory"
pattern="[^\w]+"  replacement="" replace="all"/>
            </analyzer>
            <analyzer type="query">

                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.PatternReplaceFilterFactory"
pattern="[^\w]+"  replacement="" replace="all"/>
            </analyzer>
        </fieldType>


And


<field name="text_nospaces" type="nospaces"  indexed="true" stored="true"
omitNorms="true" />
        <copyField source="text" dest="text_nospaces" />



But it is not searching the right terms . we are stripping the spaces and
indexing lowercase values when we do that.


Like : East Enders

when I seach for   'east end ers'  text, its not returning any values saying
no document found.

I realised the solr uses QueryParser before passing query string to the
QueryAnalyzer in defined in schema.

And The Query parser is tokenizing the query string providing in query . So
it is sending each token to the QueryAnalyser that is defined in schema.


SO is there anyway that I can by pass this query parser or use a correct
query processor which can consider the entire string as single pharse.

At the moment I am using dismax query processor.

Any suggestion would be much appreciated.

Thanks
Srinivasa



--
View this message in context: 
http://lucene.472066.n3.nabble.com/KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Srinivasa7 <sr...@googlemail.com>.

Hi, 

I  have similar kind of problem  where I want search for a words with spaces
in that. And I wanted to search by stripping all the spaces . 

I have used following schema for that 

<fieldType name="nospaces" class="solr.TextField"
autoGeneratePhraseQueries="true"  >
            <analyzer type="index">
            	  <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.PatternReplaceFilterFactory" 
pattern="[^\w]+"  replacement="" replace="all"/>
            </analyzer>
            <analyzer type="query">
            	
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.PatternReplaceFilterFactory" 
pattern="[^\w]+"  replacement="" replace="all"/>
            </analyzer>
        </fieldType> 


And 


<field name="text_nospaces" type="nospaces"  indexed="true" stored="true"
omitNorms="true" />
        <copyField source="text" dest="text_nospaces" />



But it is not searching the right terms . we are stripping the spaces and
indexing lowercase values when we do that. 


Like : East Enders 

when I seach for   'east end ers'  text, its not returning any values saying
no document found.

I realised the solr uses QueryParser before passing query string to the
QueryAnalyzer in defined in schema. 

And The Query parser is tokenizing the query string providing in query . So
it is sending each token to the QueryAnalyser that is defined in schema. 


SO is there anyway that I can by pass this query parser or use a correct
query processor which can consider the entire string as single pharse. 

At the moment I am using dismax query processor.

Any suggestion would be much appreciated.

Thanks 
Srinivasa



--
View this message in context: http://lucene.472066.n3.nabble.com/KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Aleksander Akerø <al...@gurusoft.no>.

Tried the following config for setting the autoGeneratePhraseQueries but it
didn't seem to change anything. Tested both "true" and "false".

<fieldType name="keyword" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class=
"solr.KeywordTokenizerFactory"/> </analyzer> <analyzer type="query"> <
tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> </fieldType>

Still I do not get any matches when searching for "FE 009" without quotes.

Set debugQuery to "on" and this is what it shows. Definitely looks like it
does this MultiPhraseQuery thing.
<lst name="debug">
<str name="rawquerystring">FE 009</str>
<str name="querystring">FE 009</str>
<str name="parsedquery">
(+(DisjunctionMaxQuery((number:FE))
DisjunctionMaxQuery((number:009))))/no_coord
</str>
<str name="parsedquery_toString">+((number:FE) (number:009))</str>
<lst name="explain"/>
<str name="QParser">ExtendedDismaxQParser</str>

I also looked into these query-parsers, but as it may look like the
splitting on whitespace is something that is done by the dismax queryparser
before the terms are passed to any analyzers. And it is vital to me that I
can differentiate this on a per field basis.

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-29 Aleksander Akerø <al...@gurusoft.no>

> Thanks a lot, I'll try the autoGeneratePhraseQueries property and see how
> that works.
>
> Regarding the reindexing tip, it's a good tip but due to the my current
> "on the fly" setup on the servers at work i basically have do build a
> project with maven and deploy to tomcat, wherein the index lies, and I
> therefore have to reindex each time otherwise the index would be empty.
> Also i usually add use the "clean" parameter when testing with DIH. So that
> shouldn't be a problem.
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-29 Alexandre Rafalovitch <ar...@gmail.com>
>
> I think the whitespace might also be the issue. The query gets parsed
>> by standard component that splits it on space before passing
>> individual components into the field searches.
>>
>> Try enabling autoGeneratePhraseQueries on the field (or field type)
>> and reindexing. See if that makes a difference.
>>
>> Regards,
>>   Alex.
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Wed, Jan 29, 2014 at 9:55 PM, Aleksander Akerø
>> <al...@gurusoft.no> wrote:
>> > update:
>> >
>> > Guessing that this has nothing to do with the tokenizer. Tried to use
>> the
>> > string fieldtype as well, but still the same results. So this must have
>> to
>> > do with some other solr config.
>> >
>> > What confuses me is that when I search "1005" which is another valid
>> value
>> > to search for, it works perfectly, but then again, this query contains
>> no
>> > whitespace.
>> >
>> > Any ideas?
>> >
>> > *Aleksander Akerø*
>> > Systemkonsulent
>> > Mobil: 944 89 054
>> > E-post: aleksander@gurusoft.no
>> >
>> > *Gurusoft AS*
>> > Telefon: 92 44 09 99
>> > Østre Kullerød
>> > www.gurusoft.no
>> >
>> >
>> > 2014-01-29 Aleksander Akerø <al...@gurusoft.no>
>> >
>> >> Thanks for the quick answer, but it doesn't help if I remove the
>> lowercase
>> >> analyzer like so:
>> >>
>> >> *        <fieldType name="keyword" class="solr.TextField"
>> >> positionIncrementGap="100">*
>> >> *            <analyzer type="index">*
>> >> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> >> *            </analyzer>*
>> >> *            <analyzer type="query">*
>> >> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> >> *            </analyzer>*
>> >> *        </fieldType>*
>> >>
>> >>  I still need to add quotes to the searchquery to get results. And the
>> >> weird thing is that if I use the analyzer and put in "FE 009" (again,
>> >> without quotes) for both index and query values, it highlights the
>> result
>> >> as to show a match, but when i search using the GUI it gives me no
>> results.
>> >> The same happens when posting directly to the /select requestHandler
>> via GET
>> >>
>> >> These is what i post using GET:
>> >> http://mysite.com/solr/corename/select?q=number:FE%20009&qf=number
>>  =>
>> >> this does not work
>> >> http://mysite.com/solr/corename/select?q=number:"FE%20009"&qf=number
>>  =>
>> >> this works
>> >>
>> >> Really starting to wonder if I am doing something terribly wrong
>> somewhere.
>> >>
>> >> This is my requestHandler btw, pretty basic:
>> >> <!-- #### Default handler #### -->
>> >>     <requestHandler name="/select" class="solr.SearchHandler">
>> >>         <lst name="defaults">
>> >>             <str name="echoParams">explicit</str>
>> >>             <str name="defType">edismax</str>
>> >>             <str name="q.alt">*:*</str>
>> >>             <str name="rows">10</str>
>> >>             <str name="fl">*,score</str>
>> >>             <str name="qf">number</str>
>> >>         </lst>
>> >>     </requestHandler>
>> >>
>> >> *Aleksander Akerø*
>> >> Systemkonsulent
>> >> Mobil: 944 89 054
>> >> E-post: aleksander@gurusoft.no
>> >>
>> >> *Gurusoft AS*
>> >> Telefon: 92 44 09 99
>> >> Østre Kullerød
>> >> www.gurusoft.no
>> >>
>> >>
>> >> 2014-01-29 Aruna Kumar Pamulapati <ap...@gmail.com>
>> >>
>> >> Hi ,
>> >>>
>> >>> I think the misunderstanding you are having is about
>> >>>
>> >>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
>> >>> lowercase
>> >>> factory.
>> >>>
>> >>> You are correct about KeywordTokenizerFactory  but lowercase factory :
>> >>> Creates
>> >>> tokens by lowercasing all letters and dropping non-letters.
>> >>>
>> >>> The best place to play and learn these pipelines is Solr admin panel
>> =>
>> >>> analysis page.
>> >>>
>> >>>
>> >>> thanks,
>> >>> Arun
>> >>>
>> >>>
>> >>> On Wed, Jan 29, 2014 at 9:05 AM, Aleksander Akerø <
>> aleksander@gurusoft.no
>> >>> >wrote:
>> >>>
>> >>> > Hi, I'll try properly this time.
>> >>> >
>> >>> > According to solr documentation the solr.KeywordTokenizerFactory
>> should
>> >>> not
>> >>> > do any tokenizing at all. Thus, if I understand this correctly, it
>> >>> should
>> >>> > only return exact matches given that this is the only analyzer
>> defined
>> >>> in
>> >>> > the field type. Such as the following config:
>> >>> >
>> >>> > Fieldtypes:
>> >>> > *       <fieldType name="keyword" class="solr.TextField"
>> >>> > positionIncrementGap="100">*
>> >>> > *            <analyzer type="index">*
>> >>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> >>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>> >>> > *            </analyzer>*
>> >>> > *            <analyzer type="query">*
>> >>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> >>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>> >>> > *            </analyzer>*
>> >>> > *        </fieldType>*
>> >>> >
>> >>> > Fields:
>> >>> > *        <field name="number" type="keyword" indexed="true"
>> >>> stored="true"
>> >>> > required="false" />*
>> >>> >
>> >>> > But it seems not to be this way for me. In the index i have values
>> like
>> >>> "FE
>> >>> > 009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of
>> course.
>> >>> But
>> >>> > when i search "FE 009" (without quotes), I get no results. It seems
>> >>> that I
>> >>> > have to add quotes to the searchquery in order to retrieve any
>> results,
>> >>> but
>> >>> > that wont't work for me, as I later on have to expand the index with
>> >>> other
>> >>> > fields that need whitespace-tokenization and such, or would that
>> work
>> >>> > regardless of quotes? I have come to understand that wrapping the
>> query
>> >>> in
>> >>> > quotes forces it to be analyzed as one token, no matter what.
>> >>> >
>> >>> > If I get this to work I would also like to add the
>> >>> > "solr.EdgeNGramFilterFactory" to the index side analyzer, thus
>> adding
>> >>> > trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as
>> well as
>> >>> > "FE 009" when searching for "FE 009", but not "EE 009", and "ED
>> 009".
>> >>> Would
>> >>> > that be an ok way to do it?
>> >>> >
>> >>> > *Aleksander Akerø*
>> >>> > Systemkonsulent
>> >>> > Mobil: 944 89 054
>> >>> > E-post: aleksander@gurusoft.no
>> >>> >
>> >>> > *Gurusoft AS*
>> >>> > Telefon: 92 44 09 99
>> >>> > Østre Kullerød
>> >>> > www.gurusoft.no
>> >>> >
>> >>>
>> >>
>> >>
>>
>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Aleksander Akerø <al...@gurusoft.no>.

Thanks a lot, I'll try the autoGeneratePhraseQueries property and see how
that works.

Regarding the reindexing tip, it's a good tip but due to the my current "on
the fly" setup on the servers at work i basically have do build a project
with maven and deploy to tomcat, wherein the index lies, and I therefore
have to reindex each time otherwise the index would be empty. Also i
usually add use the "clean" parameter when testing with DIH. So that
shouldn't be a problem.

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-29 Alexandre Rafalovitch <ar...@gmail.com>

> I think the whitespace might also be the issue. The query gets parsed
> by standard component that splits it on space before passing
> individual components into the field searches.
>
> Try enabling autoGeneratePhraseQueries on the field (or field type)
> and reindexing. See if that makes a difference.
>
> Regards,
>   Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Wed, Jan 29, 2014 at 9:55 PM, Aleksander Akerø
> <al...@gurusoft.no> wrote:
> > update:
> >
> > Guessing that this has nothing to do with the tokenizer. Tried to use the
> > string fieldtype as well, but still the same results. So this must have
> to
> > do with some other solr config.
> >
> > What confuses me is that when I search "1005" which is another valid
> value
> > to search for, it works perfectly, but then again, this query contains no
> > whitespace.
> >
> > Any ideas?
> >
> > *Aleksander Akerø*
> > Systemkonsulent
> > Mobil: 944 89 054
> > E-post: aleksander@gurusoft.no
> >
> > *Gurusoft AS*
> > Telefon: 92 44 09 99
> > Østre Kullerød
> > www.gurusoft.no
> >
> >
> > 2014-01-29 Aleksander Akerø <al...@gurusoft.no>
> >
> >> Thanks for the quick answer, but it doesn't help if I remove the
> lowercase
> >> analyzer like so:
> >>
> >> *        <fieldType name="keyword" class="solr.TextField"
> >> positionIncrementGap="100">*
> >> *            <analyzer type="index">*
> >> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> >> *            </analyzer>*
> >> *            <analyzer type="query">*
> >> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> >> *            </analyzer>*
> >> *        </fieldType>*
> >>
> >>  I still need to add quotes to the searchquery to get results. And the
> >> weird thing is that if I use the analyzer and put in "FE 009" (again,
> >> without quotes) for both index and query values, it highlights the
> result
> >> as to show a match, but when i search using the GUI it gives me no
> results.
> >> The same happens when posting directly to the /select requestHandler
> via GET
> >>
> >> These is what i post using GET:
> >> http://mysite.com/solr/corename/select?q=number:FE%20009&qf=number
>  =>
> >> this does not work
> >> http://mysite.com/solr/corename/select?q=number:"FE%20009"&qf=number
>  =>
> >> this works
> >>
> >> Really starting to wonder if I am doing something terribly wrong
> somewhere.
> >>
> >> This is my requestHandler btw, pretty basic:
> >> <!-- #### Default handler #### -->
> >>     <requestHandler name="/select" class="solr.SearchHandler">
> >>         <lst name="defaults">
> >>             <str name="echoParams">explicit</str>
> >>             <str name="defType">edismax</str>
> >>             <str name="q.alt">*:*</str>
> >>             <str name="rows">10</str>
> >>             <str name="fl">*,score</str>
> >>             <str name="qf">number</str>
> >>         </lst>
> >>     </requestHandler>
> >>
> >> *Aleksander Akerø*
> >> Systemkonsulent
> >> Mobil: 944 89 054
> >> E-post: aleksander@gurusoft.no
> >>
> >> *Gurusoft AS*
> >> Telefon: 92 44 09 99
> >> Østre Kullerød
> >> www.gurusoft.no
> >>
> >>
> >> 2014-01-29 Aruna Kumar Pamulapati <ap...@gmail.com>
> >>
> >> Hi ,
> >>>
> >>> I think the misunderstanding you are having is about
> >>>
> >>>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
> >>> lowercase
> >>> factory.
> >>>
> >>> You are correct about KeywordTokenizerFactory  but lowercase factory :
> >>> Creates
> >>> tokens by lowercasing all letters and dropping non-letters.
> >>>
> >>> The best place to play and learn these pipelines is Solr admin panel =>
> >>> analysis page.
> >>>
> >>>
> >>> thanks,
> >>> Arun
> >>>
> >>>
> >>> On Wed, Jan 29, 2014 at 9:05 AM, Aleksander Akerø <
> aleksander@gurusoft.no
> >>> >wrote:
> >>>
> >>> > Hi, I'll try properly this time.
> >>> >
> >>> > According to solr documentation the solr.KeywordTokenizerFactory
> should
> >>> not
> >>> > do any tokenizing at all. Thus, if I understand this correctly, it
> >>> should
> >>> > only return exact matches given that this is the only analyzer
> defined
> >>> in
> >>> > the field type. Such as the following config:
> >>> >
> >>> > Fieldtypes:
> >>> > *       <fieldType name="keyword" class="solr.TextField"
> >>> > positionIncrementGap="100">*
> >>> > *            <analyzer type="index">*
> >>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> >>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
> >>> > *            </analyzer>*
> >>> > *            <analyzer type="query">*
> >>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> >>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
> >>> > *            </analyzer>*
> >>> > *        </fieldType>*
> >>> >
> >>> > Fields:
> >>> > *        <field name="number" type="keyword" indexed="true"
> >>> stored="true"
> >>> > required="false" />*
> >>> >
> >>> > But it seems not to be this way for me. In the index i have values
> like
> >>> "FE
> >>> > 009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of
> course.
> >>> But
> >>> > when i search "FE 009" (without quotes), I get no results. It seems
> >>> that I
> >>> > have to add quotes to the searchquery in order to retrieve any
> results,
> >>> but
> >>> > that wont't work for me, as I later on have to expand the index with
> >>> other
> >>> > fields that need whitespace-tokenization and such, or would that work
> >>> > regardless of quotes? I have come to understand that wrapping the
> query
> >>> in
> >>> > quotes forces it to be analyzed as one token, no matter what.
> >>> >
> >>> > If I get this to work I would also like to add the
> >>> > "solr.EdgeNGramFilterFactory" to the index side analyzer, thus adding
> >>> > trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as
> well as
> >>> > "FE 009" when searching for "FE 009", but not "EE 009", and "ED 009".
> >>> Would
> >>> > that be an ok way to do it?
> >>> >
> >>> > *Aleksander Akerø*
> >>> > Systemkonsulent
> >>> > Mobil: 944 89 054
> >>> > E-post: aleksander@gurusoft.no
> >>> >
> >>> > *Gurusoft AS*
> >>> > Telefon: 92 44 09 99
> >>> > Østre Kullerød
> >>> > www.gurusoft.no
> >>> >
> >>>
> >>
> >>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I think the whitespace might also be the issue. The query gets parsed
by standard component that splits it on space before passing
individual components into the field searches.

Try enabling autoGeneratePhraseQueries on the field (or field type)
and reindexing. See if that makes a difference.

Regards,
  Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Jan 29, 2014 at 9:55 PM, Aleksander Akerø
<al...@gurusoft.no> wrote:
> update:
>
> Guessing that this has nothing to do with the tokenizer. Tried to use the
> string fieldtype as well, but still the same results. So this must have to
> do with some other solr config.
>
> What confuses me is that when I search "1005" which is another valid value
> to search for, it works perfectly, but then again, this query contains no
> whitespace.
>
> Any ideas?
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-29 Aleksander Akerø <al...@gurusoft.no>
>
>> Thanks for the quick answer, but it doesn't help if I remove the lowercase
>> analyzer like so:
>>
>> *        <fieldType name="keyword" class="solr.TextField"
>> positionIncrementGap="100">*
>> *            <analyzer type="index">*
>> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> *            </analyzer>*
>> *            <analyzer type="query">*
>> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> *            </analyzer>*
>> *        </fieldType>*
>>
>>  I still need to add quotes to the searchquery to get results. And the
>> weird thing is that if I use the analyzer and put in "FE 009" (again,
>> without quotes) for both index and query values, it highlights the result
>> as to show a match, but when i search using the GUI it gives me no results.
>> The same happens when posting directly to the /select requestHandler via GET
>>
>> These is what i post using GET:
>> http://mysite.com/solr/corename/select?q=number:FE%20009&qf=number    =>
>> this does not work
>> http://mysite.com/solr/corename/select?q=number:"FE%20009"&qf=number  =>
>> this works
>>
>> Really starting to wonder if I am doing something terribly wrong somewhere.
>>
>> This is my requestHandler btw, pretty basic:
>> <!-- #### Default handler #### -->
>>     <requestHandler name="/select" class="solr.SearchHandler">
>>         <lst name="defaults">
>>             <str name="echoParams">explicit</str>
>>             <str name="defType">edismax</str>
>>             <str name="q.alt">*:*</str>
>>             <str name="rows">10</str>
>>             <str name="fl">*,score</str>
>>             <str name="qf">number</str>
>>         </lst>
>>     </requestHandler>
>>
>> *Aleksander Akerø*
>> Systemkonsulent
>> Mobil: 944 89 054
>> E-post: aleksander@gurusoft.no
>>
>> *Gurusoft AS*
>> Telefon: 92 44 09 99
>> Østre Kullerød
>> www.gurusoft.no
>>
>>
>> 2014-01-29 Aruna Kumar Pamulapati <ap...@gmail.com>
>>
>> Hi ,
>>>
>>> I think the misunderstanding you are having is about
>>>
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
>>> lowercase
>>> factory.
>>>
>>> You are correct about KeywordTokenizerFactory  but lowercase factory :
>>> Creates
>>> tokens by lowercasing all letters and dropping non-letters.
>>>
>>> The best place to play and learn these pipelines is Solr admin panel =>
>>> analysis page.
>>>
>>>
>>> thanks,
>>> Arun
>>>
>>>
>>> On Wed, Jan 29, 2014 at 9:05 AM, Aleksander Akerø <aleksander@gurusoft.no
>>> >wrote:
>>>
>>> > Hi, I'll try properly this time.
>>> >
>>> > According to solr documentation the solr.KeywordTokenizerFactory should
>>> not
>>> > do any tokenizing at all. Thus, if I understand this correctly, it
>>> should
>>> > only return exact matches given that this is the only analyzer defined
>>> in
>>> > the field type. Such as the following config:
>>> >
>>> > Fieldtypes:
>>> > *       <fieldType name="keyword" class="solr.TextField"
>>> > positionIncrementGap="100">*
>>> > *            <analyzer type="index">*
>>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>>> > *            </analyzer>*
>>> > *            <analyzer type="query">*
>>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>>> > *            </analyzer>*
>>> > *        </fieldType>*
>>> >
>>> > Fields:
>>> > *        <field name="number" type="keyword" indexed="true"
>>> stored="true"
>>> > required="false" />*
>>> >
>>> > But it seems not to be this way for me. In the index i have values like
>>> "FE
>>> > 009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of course.
>>> But
>>> > when i search "FE 009" (without quotes), I get no results. It seems
>>> that I
>>> > have to add quotes to the searchquery in order to retrieve any results,
>>> but
>>> > that wont't work for me, as I later on have to expand the index with
>>> other
>>> > fields that need whitespace-tokenization and such, or would that work
>>> > regardless of quotes? I have come to understand that wrapping the query
>>> in
>>> > quotes forces it to be analyzed as one token, no matter what.
>>> >
>>> > If I get this to work I would also like to add the
>>> > "solr.EdgeNGramFilterFactory" to the index side analyzer, thus adding
>>> > trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as well as
>>> > "FE 009" when searching for "FE 009", but not "EE 009", and "ED 009".
>>> Would
>>> > that be an ok way to do it?
>>> >
>>> > *Aleksander Akerø*
>>> > Systemkonsulent
>>> > Mobil: 944 89 054
>>> > E-post: aleksander@gurusoft.no
>>> >
>>> > *Gurusoft AS*
>>> > Telefon: 92 44 09 99
>>> > Østre Kullerød
>>> > www.gurusoft.no
>>> >
>>>
>>
>>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Jack Krupansky <ja...@basetechnology.com>.

If you change the analyzer for a Solr field, such as adding, removing, or 
changing attributes of token filters, you must/should reindex all data (add 
it to the index again to re-analyze it.) In your case, the data was indexed 
as lower case, so after your changes a query with upper case would not 
match.

-- Jack Krupansky

-----Original Message----- 
From: Aleksander Akerø
Sent: Wednesday, January 29, 2014 9:55 AM
To: solr-user@lucene.apache.org
Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches

update:

Guessing that this has nothing to do with the tokenizer. Tried to use the
string fieldtype as well, but still the same results. So this must have to
do with some other solr config.

What confuses me is that when I search "1005" which is another valid value
to search for, it works perfectly, but then again, this query contains no
whitespace.

Any ideas?

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-29 Aleksander Akerø <al...@gurusoft.no>

> Thanks for the quick answer, but it doesn't help if I remove the lowercase
> analyzer like so:
>
> *        <fieldType name="keyword" class="solr.TextField"
> positionIncrementGap="100">*
> *            <analyzer type="index">*
> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> *            </analyzer>*
> *            <analyzer type="query">*
> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> *            </analyzer>*
> *        </fieldType>*
>
>  I still need to add quotes to the searchquery to get results. And the
> weird thing is that if I use the analyzer and put in "FE 009" (again,
> without quotes) for both index and query values, it highlights the result
> as to show a match, but when i search using the GUI it gives me no 
> results.
> The same happens when posting directly to the /select requestHandler via 
> GET
>
> These is what i post using GET:
> http://mysite.com/solr/corename/select?q=number:FE%20009&qf=number    =>
> this does not work
> http://mysite.com/solr/corename/select?q=number:"FE%20009"&qf=number  =>
> this works
>
> Really starting to wonder if I am doing something terribly wrong 
> somewhere.
>
> This is my requestHandler btw, pretty basic:
> <!-- #### Default handler #### -->
>     <requestHandler name="/select" class="solr.SearchHandler">
>         <lst name="defaults">
>             <str name="echoParams">explicit</str>
>             <str name="defType">edismax</str>
>             <str name="q.alt">*:*</str>
>             <str name="rows">10</str>
>             <str name="fl">*,score</str>
>             <str name="qf">number</str>
>         </lst>
>     </requestHandler>
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-29 Aruna Kumar Pamulapati <ap...@gmail.com>
>
> Hi ,
>>
>> I think the misunderstanding you are having is about
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
>> lowercase
>> factory.
>>
>> You are correct about KeywordTokenizerFactory  but lowercase factory :
>> Creates
>> tokens by lowercasing all letters and dropping non-letters.
>>
>> The best place to play and learn these pipelines is Solr admin panel =>
>> analysis page.
>>
>>
>> thanks,
>> Arun
>>
>>
>> On Wed, Jan 29, 2014 at 9:05 AM, Aleksander Akerø <aleksander@gurusoft.no
>> >wrote:
>>
>> > Hi, I'll try properly this time.
>> >
>> > According to solr documentation the solr.KeywordTokenizerFactory should
>> not
>> > do any tokenizing at all. Thus, if I understand this correctly, it
>> should
>> > only return exact matches given that this is the only analyzer defined
>> in
>> > the field type. Such as the following config:
>> >
>> > Fieldtypes:
>> > *       <fieldType name="keyword" class="solr.TextField"
>> > positionIncrementGap="100">*
>> > *            <analyzer type="index">*
>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>> > *            </analyzer>*
>> > *            <analyzer type="query">*
>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>> > *            </analyzer>*
>> > *        </fieldType>*
>> >
>> > Fields:
>> > *        <field name="number" type="keyword" indexed="true"
>> stored="true"
>> > required="false" />*
>> >
>> > But it seems not to be this way for me. In the index i have values like
>> "FE
>> > 009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of course.
>> But
>> > when i search "FE 009" (without quotes), I get no results. It seems
>> that I
>> > have to add quotes to the searchquery in order to retrieve any results,
>> but
>> > that wont't work for me, as I later on have to expand the index with
>> other
>> > fields that need whitespace-tokenization and such, or would that work
>> > regardless of quotes? I have come to understand that wrapping the query
>> in
>> > quotes forces it to be analyzed as one token, no matter what.
>> >
>> > If I get this to work I would also like to add the
>> > "solr.EdgeNGramFilterFactory" to the index side analyzer, thus adding
>> > trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as well 
>> > as
>> > "FE 009" when searching for "FE 009", but not "EE 009", and "ED 009".
>> Would
>> > that be an ok way to do it?
>> >
>> > *Aleksander Akerø*
>> > Systemkonsulent
>> > Mobil: 944 89 054
>> > E-post: aleksander@gurusoft.no
>> >
>> > *Gurusoft AS*
>> > Telefon: 92 44 09 99
>> > Østre Kullerød
>> > www.gurusoft.no
>> >
>>
>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Aleksander Akerø <al...@gurusoft.no>.

update:

Guessing that this has nothing to do with the tokenizer. Tried to use the
string fieldtype as well, but still the same results. So this must have to
do with some other solr config.

What confuses me is that when I search "1005" which is another valid value
to search for, it works perfectly, but then again, this query contains no
whitespace.

Any ideas?

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-29 Aleksander Akerø <al...@gurusoft.no>

> Thanks for the quick answer, but it doesn't help if I remove the lowercase
> analyzer like so:
>
> *        <fieldType name="keyword" class="solr.TextField"
> positionIncrementGap="100">*
> *            <analyzer type="index">*
> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> *            </analyzer>*
> *            <analyzer type="query">*
> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> *            </analyzer>*
> *        </fieldType>*
>
>  I still need to add quotes to the searchquery to get results. And the
> weird thing is that if I use the analyzer and put in "FE 009" (again,
> without quotes) for both index and query values, it highlights the result
> as to show a match, but when i search using the GUI it gives me no results.
> The same happens when posting directly to the /select requestHandler via GET
>
> These is what i post using GET:
> http://mysite.com/solr/corename/select?q=number:FE%20009&qf=number    =>
> this does not work
> http://mysite.com/solr/corename/select?q=number:"FE%20009"&qf=number  =>
> this works
>
> Really starting to wonder if I am doing something terribly wrong somewhere.
>
> This is my requestHandler btw, pretty basic:
> <!-- #### Default handler #### -->
>     <requestHandler name="/select" class="solr.SearchHandler">
>         <lst name="defaults">
>             <str name="echoParams">explicit</str>
>             <str name="defType">edismax</str>
>             <str name="q.alt">*:*</str>
>             <str name="rows">10</str>
>             <str name="fl">*,score</str>
>             <str name="qf">number</str>
>         </lst>
>     </requestHandler>
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-29 Aruna Kumar Pamulapati <ap...@gmail.com>
>
> Hi ,
>>
>> I think the misunderstanding you are having is about
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
>> lowercase
>> factory.
>>
>> You are correct about KeywordTokenizerFactory  but lowercase factory :
>> Creates
>> tokens by lowercasing all letters and dropping non-letters.
>>
>> The best place to play and learn these pipelines is Solr admin panel =>
>> analysis page.
>>
>>
>> thanks,
>> Arun
>>
>>
>> On Wed, Jan 29, 2014 at 9:05 AM, Aleksander Akerø <aleksander@gurusoft.no
>> >wrote:
>>
>> > Hi, I'll try properly this time.
>> >
>> > According to solr documentation the solr.KeywordTokenizerFactory should
>> not
>> > do any tokenizing at all. Thus, if I understand this correctly, it
>> should
>> > only return exact matches given that this is the only analyzer defined
>> in
>> > the field type. Such as the following config:
>> >
>> > Fieldtypes:
>> > *       <fieldType name="keyword" class="solr.TextField"
>> > positionIncrementGap="100">*
>> > *            <analyzer type="index">*
>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>> > *            </analyzer>*
>> > *            <analyzer type="query">*
>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>> > *            </analyzer>*
>> > *        </fieldType>*
>> >
>> > Fields:
>> > *        <field name="number" type="keyword" indexed="true"
>> stored="true"
>> > required="false" />*
>> >
>> > But it seems not to be this way for me. In the index i have values like
>> "FE
>> > 009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of course.
>> But
>> > when i search "FE 009" (without quotes), I get no results. It seems
>> that I
>> > have to add quotes to the searchquery in order to retrieve any results,
>> but
>> > that wont't work for me, as I later on have to expand the index with
>> other
>> > fields that need whitespace-tokenization and such, or would that work
>> > regardless of quotes? I have come to understand that wrapping the query
>> in
>> > quotes forces it to be analyzed as one token, no matter what.
>> >
>> > If I get this to work I would also like to add the
>> > "solr.EdgeNGramFilterFactory" to the index side analyzer, thus adding
>> > trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as well as
>> > "FE 009" when searching for "FE 009", but not "EE 009", and "ED 009".
>> Would
>> > that be an ok way to do it?
>> >
>> > *Aleksander Akerø*
>> > Systemkonsulent
>> > Mobil: 944 89 054
>> > E-post: aleksander@gurusoft.no
>> >
>> > *Gurusoft AS*
>> > Telefon: 92 44 09 99
>> > Østre Kullerød
>> > www.gurusoft.no
>> >
>>
>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Aleksander Akerø <al...@gurusoft.no>.

Thanks for the quick answer, but it doesn't help if I remove the lowercase
analyzer like so:

*        <fieldType name="keyword" class="solr.TextField"
positionIncrementGap="100">*
*            <analyzer type="index">*
*                <tokenizer class="solr.KeywordTokenizerFactory"/>*
*            </analyzer>*
*            <analyzer type="query">*
*                <tokenizer class="solr.KeywordTokenizerFactory"/>*
*            </analyzer>*
*        </fieldType>*

I still need to add quotes to the searchquery to get results. And the weird
thing is that if I use the analyzer and put in "FE 009" (again, without
quotes) for both index and query values, it highlights the result as to
show a match, but when i search using the GUI it gives me no results. The
same happens when posting directly to the /select requestHandler via GET

These is what i post using GET:
http://mysite.com/solr/corename/select?q=number:FE%20009&qf=number    =>
this does not work
http://mysite.com/solr/corename/select?q=number:"FE%20009"&qf=number  =>
this works

Really starting to wonder if I am doing something terribly wrong somewhere.

This is my requestHandler btw, pretty basic:
<!-- #### Default handler #### -->
    <requestHandler name="/select" class="solr.SearchHandler">
        <lst name="defaults">
            <str name="echoParams">explicit</str>
            <str name="defType">edismax</str>
            <str name="q.alt">*:*</str>
            <str name="rows">10</str>
            <str name="fl">*,score</str>
            <str name="qf">number</str>
        </lst>
    </requestHandler>

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksander@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-29 Aruna Kumar Pamulapati <ap...@gmail.com>

> Hi ,
>
> I think the misunderstanding you are having is about
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
> lowercase
> factory.
>
> You are correct about KeywordTokenizerFactory  but lowercase factory :
> Creates
> tokens by lowercasing all letters and dropping non-letters.
>
> The best place to play and learn these pipelines is Solr admin panel =>
> analysis page.
>
>
> thanks,
> Arun
>
>
> On Wed, Jan 29, 2014 at 9:05 AM, Aleksander Akerø <aleksander@gurusoft.no
> >wrote:
>
> > Hi, I'll try properly this time.
> >
> > According to solr documentation the solr.KeywordTokenizerFactory should
> not
> > do any tokenizing at all. Thus, if I understand this correctly, it should
> > only return exact matches given that this is the only analyzer defined in
> > the field type. Such as the following config:
> >
> > Fieldtypes:
> > *       <fieldType name="keyword" class="solr.TextField"
> > positionIncrementGap="100">*
> > *            <analyzer type="index">*
> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> > *                <filter class="solr.LowerCaseFilterFactory"/>*
> > *            </analyzer>*
> > *            <analyzer type="query">*
> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> > *                <filter class="solr.LowerCaseFilterFactory"/>*
> > *            </analyzer>*
> > *        </fieldType>*
> >
> > Fields:
> > *        <field name="number" type="keyword" indexed="true" stored="true"
> > required="false" />*
> >
> > But it seems not to be this way for me. In the index i have values like
> "FE
> > 009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of course.
> But
> > when i search "FE 009" (without quotes), I get no results. It seems that
> I
> > have to add quotes to the searchquery in order to retrieve any results,
> but
> > that wont't work for me, as I later on have to expand the index with
> other
> > fields that need whitespace-tokenization and such, or would that work
> > regardless of quotes? I have come to understand that wrapping the query
> in
> > quotes forces it to be analyzed as one token, no matter what.
> >
> > If I get this to work I would also like to add the
> > "solr.EdgeNGramFilterFactory" to the index side analyzer, thus adding
> > trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as well as
> > "FE 009" when searching for "FE 009", but not "EE 009", and "ED 009".
> Would
> > that be an ok way to do it?
> >
> > *Aleksander Akerø*
> > Systemkonsulent
> > Mobil: 944 89 054
> > E-post: aleksander@gurusoft.no
> >
> > *Gurusoft AS*
> > Telefon: 92 44 09 99
> > Østre Kullerød
> > www.gurusoft.no
> >
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Posted by Aruna Kumar Pamulapati <ap...@gmail.com>.

Hi ,

I think the misunderstanding you are having is about
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
lowercase
factory.

You are correct about KeywordTokenizerFactory  but lowercase factory : Creates
tokens by lowercasing all letters and dropping non-letters.

The best place to play and learn these pipelines is Solr admin panel =>
analysis page.


thanks,
Arun


On Wed, Jan 29, 2014 at 9:05 AM, Aleksander Akerø <al...@gurusoft.no>wrote:

> Hi, I'll try properly this time.
>
> According to solr documentation the solr.KeywordTokenizerFactory should not
> do any tokenizing at all. Thus, if I understand this correctly, it should
> only return exact matches given that this is the only analyzer defined in
> the field type. Such as the following config:
>
> Fieldtypes:
> *       <fieldType name="keyword" class="solr.TextField"
> positionIncrementGap="100">*
> *            <analyzer type="index">*
> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> *                <filter class="solr.LowerCaseFilterFactory"/>*
> *            </analyzer>*
> *            <analyzer type="query">*
> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
> *                <filter class="solr.LowerCaseFilterFactory"/>*
> *            </analyzer>*
> *        </fieldType>*
>
> Fields:
> *        <field name="number" type="keyword" indexed="true" stored="true"
> required="false" />*
>
> But it seems not to be this way for me. In the index i have values like "FE
> 009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of course. But
> when i search "FE 009" (without quotes), I get no results. It seems that I
> have to add quotes to the searchquery in order to retrieve any results, but
> that wont't work for me, as I later on have to expand the index with other
> fields that need whitespace-tokenization and such, or would that work
> regardless of quotes? I have come to understand that wrapping the query in
> quotes forces it to be analyzed as one token, no matter what.
>
> If I get this to work I would also like to add the
> "solr.EdgeNGramFilterFactory" to the index side analyzer, thus adding
> trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as well as
> "FE 009" when searching for "FE 009", but not "EE 009", and "ED 009". Would
> that be an ok way to do it?
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>