You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Vadim Gorlovetsky <va...@amdocs.com> on 2015/03/25 18:04:53 UTC

RE: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

Thanks for a quick response.

A bit confusing that analyzer of "query" type configured to use KeywordTokenizerFactory does not un-tokenize query criteria.
I guess whitespace only the special case because it separates phrases in a query and runs prior analyzing.

Actually I am handling a query the way you are recommended:
Double quotes for exact matching and escaped whitespace for a values with wildcards (double quotes do not work as probably considering "*" wildcard as a part of the criteria value).

Thanks
Vadim

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Wednesday, March 25, 2015 6:34 PM
To: solr-user@lucene.apache.org
Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

This is a _very_ common thing we all had to learn; what you're seeing is the results of the _query parser_, not the analysis chain. Anything like
proj_name_sort:term1 term2 gets split at the query parser level, attaching &debug=query to the URL should show down in the "parsed query" section something like:

proj_name_sort:term1 default_search_field:term2

To get thing through the query parser, enclose in double quotes, escape the space and such. That'll get the terms _as a single token_ to the analysis chain for that field where the behavior will be what you expect.

Best,
Erick

On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky <va...@amdocs.com> wrote:
> Hello,
>
> solr.KeywordTokenizerFactory seems splitting by whitespaces though according SOLR documentation shouldn't do that.
>
>
> For example I have the following configuration for the fields "proj_name" and "proj_name_sort":
>
> <field name="proj_name" type="sortable_text_general" indexed="true" 
> stored="true"/> <field name="proj_name_sort" type="string_sort" 
> indexed="true" stored="false"/> ......
>
> <copyField source="proj_name" dest="proj_name_sort" /> 
> ..................
>
> <fieldType name="string_sort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
>   <analyzer>
>     <!-- KeywordTokenizer does no actual tokenizing, so the entire
>          input string is preserved as a single token
>      -->
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <!-- The LowerCase TokenFilter does what you expect, which can be
>          when you want your sorting to be case insensitive
>       -->
>     <filter class="solr.LowerCaseFilterFactory" />
>     <!-- The TrimFilter removes any leading or trailing whitespace -->
>     <filter class="solr.TrimFilterFactory" />
>   </analyzer>
> </fieldType>
>
> There are 3 indexed documents having the respective field values:
> proj_name:
> Test1008
> CR610070 Test1
> CR610070 Another Test2
>
> Searching on the "proj_name_sort" giving me the following results:
>
> Query
>
> Expected
>
> Real
>
> Comments
>
> proj_name_sort : CR610070 Test1
>
> CR610070 Test1
>
> CR610070 Test1
>
> Expectable as seems searching exact un-tokenized value
>
> proj_name_sort : CR610070 Te
>
> None
>
> None
>
> Expectable as seems searching exact un-tokenized value
>
> proj_name_sort : CR610070 Te*
>
> CR610070 Test1
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
> proj_name_sort : CR610070 An*
>
> CR610070 Another Test2
>
> CR610070 Another Test2
>
> Expectable as seems applying wild card on un-tokenized value
>
> proj_name_sort : CR610070 Another Te*
>
> CR610070 Another Test2
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
> proj_name_sort : CR610070 Another Test1*
>
> None
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
>
> Please, advise the way to search on un-tokenized fields using partial criteria and wild cards.
>
> Thanks
> Vadim
>
>
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement, you may review at http://www.amdocs.com/email_disclaimer.asp

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Re: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

Posted by Erick Erickson <er...@gmail.com>.

Yeah, this is a head scratcher. But it _has_ to be that way for things
like edismax to work where you mix-and-match fielded and un-fielded
terms. I.e. I can have a query like "q=field1:whatever some more
stuff&qf=field2,field3,field4" where I want "whatever" to be evaluated
only against field1, but the remaining terms to be searched for in the
three other fields.

The deal is that how you want _individual terms_ handled at index time
may be different than at query time, WordDelimiterFilterFactory and
SynonymFilterFactory are prime examples of this. Getting my head
around why field analysis is completely different from query _parsing_
took me a while. But the fact that both are "query" is confusing, I'm
just not sure what would be better since they're very closely related,
they just both deal with queries just at different times.

Missed the wildcards, you're right you need to escape.... Or.... use
the "prefix" query parser. It'd look like:
q={!prefix f=proj_name_sort}CR610070 An

No escaping is necessary. If you add &debug=query to a query using the
prefix queries you see that there's an implied * trailing... Do be
aware, though, that there is _no_ analysis done, so things like
lowercasing would have to be done by the app.

Neither one is "more correct", in fact I believe that the wildcard
query becomes a prefix query eventually, strictly a matter of how you
want to deal with that in the app.

Best,
Erick

On Wed, Mar 25, 2015 at 10:04 AM, Vadim Gorlovetsky <va...@amdocs.com> wrote:
> Thanks for a quick response.
>
> A bit confusing that analyzer of "query" type configured to use KeywordTokenizerFactory does not un-tokenize query criteria.
> I guess whitespace only the special case because it separates phrases in a query and runs prior analyzing.
>
> Actually I am handling a query the way you are recommended:
> Double quotes for exact matching and escaped whitespace for a values with wildcards (double quotes do not work as probably considering "*" wildcard as a part of the criteria value).
>
> Thanks
> Vadim
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, March 25, 2015 6:34 PM
> To: solr-user@lucene.apache.org
> Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces
>
> This is a _very_ common thing we all had to learn; what you're seeing is the results of the _query parser_, not the analysis chain. Anything like
> proj_name_sort:term1 term2 gets split at the query parser level, attaching &debug=query to the URL should show down in the "parsed query" section something like:
>
> proj_name_sort:term1 default_search_field:term2
>
> To get thing through the query parser, enclose in double quotes, escape the space and such. That'll get the terms _as a single token_ to the analysis chain for that field where the behavior will be what you expect.
>
> Best,
> Erick
>
> On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky <va...@amdocs.com> wrote:
>> Hello,
>>
>> solr.KeywordTokenizerFactory seems splitting by whitespaces though according SOLR documentation shouldn't do that.
>>
>>
>> For example I have the following configuration for the fields "proj_name" and "proj_name_sort":
>>
>> <field name="proj_name" type="sortable_text_general" indexed="true"
>> stored="true"/> <field name="proj_name_sort" type="string_sort"
>> indexed="true" stored="false"/> ......
>>
>> <copyField source="proj_name" dest="proj_name_sort" />
>> ..................
>>
>> <fieldType name="string_sort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
>>   <analyzer>
>>     <!-- KeywordTokenizer does no actual tokenizing, so the entire
>>          input string is preserved as a single token
>>      -->
>>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>>     <!-- The LowerCase TokenFilter does what you expect, which can be
>>          when you want your sorting to be case insensitive
>>       -->
>>     <filter class="solr.LowerCaseFilterFactory" />
>>     <!-- The TrimFilter removes any leading or trailing whitespace -->
>>     <filter class="solr.TrimFilterFactory" />
>>   </analyzer>
>> </fieldType>
>>
>> There are 3 indexed documents having the respective field values:
>> proj_name:
>> Test1008
>> CR610070 Test1
>> CR610070 Another Test2
>>
>> Searching on the "proj_name_sort" giving me the following results:
>>
>> Query
>>
>> Expected
>>
>> Real
>>
>> Comments
>>
>> proj_name_sort : CR610070 Test1
>>
>> CR610070 Test1
>>
>> CR610070 Test1
>>
>> Expectable as seems searching exact un-tokenized value
>>
>> proj_name_sort : CR610070 Te
>>
>> None
>>
>> None
>>
>> Expectable as seems searching exact un-tokenized value
>>
>> proj_name_sort : CR610070 Te*
>>
>> CR610070 Test1
>>
>> CR610070 Test1, Test1008, CR610070 Another Test2
>>
>> Seems splits on tokens by whitespace ?????
>>
>> proj_name_sort : CR610070 An*
>>
>> CR610070 Another Test2
>>
>> CR610070 Another Test2
>>
>> Expectable as seems applying wild card on un-tokenized value
>>
>> proj_name_sort : CR610070 Another Te*
>>
>> CR610070 Another Test2
>>
>> CR610070 Test1, Test1008, CR610070 Another Test2
>>
>> Seems splits on tokens by whitespace ?????
>>
>> proj_name_sort : CR610070 Another Test1*
>>
>> None
>>
>> CR610070 Test1, Test1008, CR610070 Another Test2
>>
>> Seems splits on tokens by whitespace ?????
>>
>>
>> Please, advise the way to search on un-tokenized fields using partial criteria and wild cards.
>>
>> Thanks
>> Vadim
>>
>>
>> This message and the information contained herein is proprietary and
>> confidential and subject to the Amdocs policy statement, you may review at http://www.amdocs.com/email_disclaimer.asp
>
> This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
> you may review at http://www.amdocs.com/email_disclaimer.asp