You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andrea Gazzarini <an...@gmail.com> on 2013/08/12 11:13:59 UTC

Tokenization at query time

Hi all,
I have a field (among others)in my schema defined like this:

<fieldtype name="mytype" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
         <tokenizer class="solr.*KeywordTokenizerFactory*" />
         <filter class="solr.LowerCaseFilterFactory" />
         <filter class="solr.WordDelimiterFilterFactory"
             generateWordParts="0"
             generateNumberParts="0"
             catenateWords="0"
             catenateNumbers="0"
             catenateAll="1"
             splitOnCaseChange="0" />
     </analyzer>
</fieldtype>

<field name="myfield" type="mytype" indexed="true"/>

Basically, both at index and query time the field value is normalized 
like this.

Mag. 778 G 69 => mag778g69

Now, in my solrconfig I'm using a search handler like this:

<requestHandler ....>
     ...
     <str name="defType">dismax</str>
     ...
     <str name="mm">100%</str>
     <str name="qf">myfield^3000</str>
     <str name="pf">myfield^30000</str>

</requestHandler>

What I'm expecting is that if I index a document with a value for my 
field "Mag. 778 G 69", I will be able to get this document by querying

1. Mag. 778 G 69
2. mag 778 g69
3. mag778g69

But that doesn't wotk: i'm able to get the document only and if only I 
use the "normalized2 form: mag778g69

After doing a little bit of debug, I see that, even I used a 
KeywordTokenizer in my field type declaration, SOLR is doing soemthign 
like this:
/
// +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1) 
DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1) 
DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1) 
DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4) 
DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/

That is, it is tokenizing the original query string (mag + 778 + g + 69) 
and obviously querying the field for separate tokens doesn't match 
anything (at least this is what I think)

Does anybody could please explain me that?

Thanks in advance
Andrea

Re: Tokenization at query time

Posted by Andrea Gazzarini <an...@gmail.com>.

On 08/26/2013 04:09 PM, Erick Erickson wrote:
> right, edismax is much preferred, dismax hasn't been formally deprecated,
> but almost nobody uses it...
Good to know...I basically use dismax in ALL my SOLR instances :D
> I'd be really careful about adding whitespace to the list of escape chars
> because it changes the semantics of the search. While it'll work for this
> specific case, if you use it in other cases it will change the sense of the
> query. This may be OK, but be careful, it might be better to do this
> specifically on an as-needed basis...
Yes, that's the reason why I'm not really sure about what I did...I'm 
running my regression tests...all seems green...let's see
> But you know your problem space best
>
> Best,
> Erick
Thank you very much

Best,
Gazza
>
>
> On Mon, Aug 26, 2013 at 9:04 AM, Andrea Gazzarini <
> andrea.gazzarini@gmail.com> wrote:
>
>> Hi Erick,
>> sorry I forgot the SOLR version...is the 3.6.0
>>
>> ClientUtils in that version does whitespace escaping:
>>
>>    public static String escapeQueryChars(String s) {
>>      StringBuilder sb = new StringBuilder();
>>      for (int i = 0; i < s.length(); i++) {
>>        char c = s.charAt(i);
>>        // These characters are part of the query syntax and must be escaped
>>        if (c == '\\' || c == '+' || c == '-' || c == '!'  || c == '(' || c
>> == ')' || c == ':'
>>          || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || c
>> == '}' || c == '~'
>>          || c == '*' || c == '?' || c == '|' || c == '&'  || c == ';'
>>          || Character.isWhitespace(c)) {
>>          sb.append('\\');
>>        }
>>        sb.append(c);
>>      }
>>      return sb.toString();
>>    }
>>
>> Now, I solved the issue but not really sure about that.
>>
>> Debugging the code I saw that the query string (on the SearchHandler)
>>
>>
>> 978\ 90\ 04\ 23560\ 1
>>
>> once passed through DismaxQueryParser (specifically through
>> SolrPluginUtils.partialEscape(**CharSequence)
>>
>> becames
>>
>>
>> 978\\ 90\\ 04\\ 23560\\ 1
>>
>> because that method escapes the backslashes
>>
>> So, using the eclipse debugger I removed at runtime the additional
>> backslash and it works perfectly but of course...I can't do that in
>> production for every search :D
>>
>> So, just to try I changed dismax in edismax which, I saw, doesn't call
>> SolrPluginUtils....and it works perfectly!
>>
>> I saw in your query string that you used edismax too...maybe is that the
>> point?
>>
>> Many thanks
>> Andrea
>>
>>
>> On 08/26/2013 02:47 PM, Erick Erickson wrote:
>>
>>> Andrea:
>>>
>>> Works for me, admittedly through the browser....
>>>
>>> I suspect the problem is here: ClientUtils.**escapeQueryChars
>>>
>>>
>>> That doesn't do anything about escaping the spaces, it just handles
>>> characters that have special meaning to the query syntax, things like +-
>>> etc.
>>>
>>> Using your field definition, this:
>>> http://localhost:8983/solr/**select?wt=json&q=ab\<http://localhost:8983/solr/select?wt=json&q=ab%5C>cd\
>>> ef&debug=query&defType=**edismax&qf=name eoe
>>> produced this output..
>>>
>>>      - parsedquery_toString: "+(eoe:abcdef | (name:ab name:cd name:ef))",
>>>
>>>
>>>
>>> where the field "eoe" is your isbn_issn type.
>>>
>>> Best,
>>> Erick
>>>
>>>
>>> On Mon, Aug 26, 2013 at 4:55 AM, Andrea Gazzarini <
>>> andrea.gazzarini@gmail.com> wrote:
>>>
>>>   Hi Erick,
>>>> escaping spaces doesn't work...
>>>>
>>>> Briefly,
>>>>
>>>> - In a document I have an ISBN field that (stored value) is
>>>> *978-90-04-23560-1*
>>>> - In the index I have this value: *9789004235601*
>>>>
>>>> Now, I want be able to search the document by using:
>>>>
>>>> 1) q=*978-90-04-23560-1*
>>>> 2) q=*978 90 04 23560 1*
>>>> 3) q=*9789004235601*
>>>>
>>>> 1 and 3 works perfectly, 2 doesn't work.
>>>>
>>>> My code is:
>>>>
>>>> /SolrQuery query = new SolrQuery(ClientUtils.****escapeQueryChars(req.**
>>>>
>>>> getParameter("q")));/
>>>>
>>>> isbn is declared in this way
>>>>
>>>> <fieldtype name="isbn_issn" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>       <analyzer>
>>>>           <tokenizer class="*solr.****KeywordTokenizerFactory*"/>
>>>>
>>>>
>>>>           <filter class="solr.****LowerCaseFilterFactory"/>
>>>>           <filter class="solr.****WordDelimiterFilterFactory"
>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0"
>>>> catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"/>
>>>>       </analyzer>
>>>> </fieldtype>
>>>> <field name="isbn_issn_search" type="issn_isbn" indexed="true"/>
>>>> search handler is:
>>>>
>>>>       <requestHandler name="any_bc" class="solr.SearchHandler"
>>>> default="true">
>>>>           <lst name="defaults">
>>>>               <str name="defType">*dismax*</str>
>>>>
>>>>               <str name="mm">100%</str>
>>>>               <str name="qf">
>>>> *isbn_issn_search*^10000
>>>>               </str>
>>>>               <str name="pf">
>>>> *isbn_issn_search*^100000
>>>>               </str>
>>>>               <int name="ps">0</int>
>>>>               <float name="tie">0.1</float>
>>>>               ...
>>>>       </requestHandler>
>>>>
>>>> This is what I get:
>>>>
>>>> *1) 978-90-04-23560-1**
>>>> *path=/select params={start=0&q=*978\-90\-****
>>>> 04\-23560\-1*&sfield=&qt=any_*
>>>> *bc&wt=javabin&rows=10&**version=**2} *hits=1* status=0 QTime=5*
>>>>
>>>> 2) ***9789004235601*
>>>> *webapp=/solr path=/select params={start=0&q=***
>>>> 9789004235601*&sfield=&qt=any_****bc&wt=javabin&rows=10&**version=**2}
>>>>
>>>> *hits=1* status=0 QTime=5*
>>>>
>>>> 3) **978 90 04 23560 1**
>>>> *path=/select params={start=0&*q=978\+90\+****
>>>> 04\+23560\+1*&sfield=&qt=any_*
>>>> *bc&wt=javabin&rows=10&**version=**2} *hits=0 *status=0 QTime=2*
>>>>
>>>>
>>>> *Extract from queryDebug=true:
>>>>
>>>> <str name="q">978\ 90\ 04\ 23560\ 1</str>
>>>> ...
>>>> <str name="rawquerystring">978\ 90\ 04\ 23560\ 1</str>
>>>> <str name="querystring">978\ 90\ 04\ 23560\ 1</str>
>>>> ...
>>>> <str name="parsedquery">
>>>>       +((DisjunctionMaxQuery((isbn_****issn_search:*978*^10000.0)~0.**
>>>> **1)
>>>>       DisjunctionMaxQuery((isbn_****issn_search:*90*^10000.0)~0.1)
>>>>       DisjunctionMaxQuery((isbn_****issn_search:*04*^10000.0)~0.1)
>>>>       DisjunctionMaxQuery((isbn_****issn_search:*23560*^10000.0)~****0.1)
>>>>       DisjunctionMaxQuery((isbn_****issn_search:*1*^10000.0)~0.1))****~5)
>>>>       DisjunctionMaxQuery((isbn_****issn_search:*9789004235601*^**
>>>> 100000.0)~0.1)
>>>> </str>
>>>>
>>>> ------------------------------****------------------
>>>>
>>>> Probably this is a very stupid question but I'm going crazy. In this page
>>>>
>>>> http://wiki.apache.org/solr/****DisMaxQParserPlugin<http://wiki.apache.org/solr/**DisMaxQParserPlugin>
>>>> <http://**wiki.apache.org/solr/**DisMaxQParserPlugin<http://wiki.apache.org/solr/DisMaxQParserPlugin>
>>>>
>>>> *Query Structure*
>>>>
>>>> /For each "word" in the query string, dismax builds a DisjunctionMaxQuery
>>>> object for that word across all of the fields in the //qf//param...
>>>>
>>>> /And seems exactly what it is doing...but what is a "word"? How can I
>>>> force//(without using double quotes) spaces in a way that they are
>>>> considered part of the word/?
>>>>
>>>> /Many many many thanks
>>>> Andrea
>>>>
>>>>
>>>> On 08/13/2013 04:18 PM, Erick Erickson wrote:
>>>>
>>>>   I think you can get what you want by escaping the space with a
>>>>> backslash....
>>>>>
>>>>> YMMV of course.
>>>>> Erick
>>>>>
>>>>>
>>>>> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>
>>>>>    Hi Erick,
>>>>>
>>>>>> sorry if that wasn't clear: this is what I'm actually observing in my
>>>>>> application.
>>>>>>
>>>>>> I wrote the first post after looking at the explain (debugQuery=true):
>>>>>> the
>>>>>> query
>>>>>>
>>>>>> q=mag 778 G 69
>>>>>>
>>>>>> is translated as follow:
>>>>>>
>>>>>>
>>>>>> /  +((DisjunctionMaxQuery((//******myfield://*mag*//^3000.0)~0.1)
>>>>>>          DisjunctionMaxQuery((//******myfield://*778*//^3000.0)~0.1)
>>>>>>          DisjunctionMaxQuery((//******myfield://*g*//^3000.0)~0.1)
>>>>>>          DisjunctionMaxQuery((//******myfield://*69*//^3000.0)~0.1))**
>>>>>> ****~4)
>>>>>>          DisjunctionMaxQuery((//******myfield://*mag778g69*//^30000.**
>>>>>> **
>>>>>> **0)~0.1)/
>>>>>>
>>>>>> It seems that althouhg I declare myfield with this type
>>>>>>
>>>>>> /<fieldtype name="type1" class="solr.TextField" >
>>>>>>
>>>>>>        <analyzer>
>>>>>>            <tokenizer class="solr.******KeywordTokenizerFactory*" />
>>>>>>
>>>>>>            <filter class="solr.******LowerCaseFilterFactory" />
>>>>>>            <filter class="solr.******WordDelimiterFilterFactory"
>>>>>> generateWordParts="0" generateNumberParts="0"
>>>>>>                catenateWords="0" catenateNumbers="0" catenateAll="1"****
>>>>>>
>>>>>> splitOnCaseChange="0"
>>>>>>
>>>>>> />
>>>>>>        </analyzer>
>>>>>> </fieldtype>
>>>>>>
>>>>>> /SOLR is tokenizing it therefore by producing several tokens
>>>>>> (mag,778,g,69)/
>>>>>> /
>>>>>>
>>>>>> And I can't put double quotes on the query (q="mag 778 G 69") because
>>>>>> the
>>>>>> request handler searches also in other fields (with different
>>>>>> configuration
>>>>>> chains)
>>>>>>
>>>>>> As I understood the query parser, (i.e. query time), does a whitespace
>>>>>> tokenization on its own before invoking my (query-time) chain. The same
>>>>>> doesn't happen at index time...this is my problem...because at index
>>>>>> time
>>>>>> the field is analyzed exactly as I want...but unfortunately cannot say
>>>>>> the
>>>>>> same at query time.
>>>>>>
>>>>>> Sorry for my wonderful english, did you get the point?
>>>>>>
>>>>>>
>>>>>> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>>>>>>
>>>>>>    On a quick scan I don't see a problem here. Attach
>>>>>>
>>>>>>> &debug=query to your url and that'll show you the
>>>>>>> parsed query, which will in turn show you what's been
>>>>>>> pushed through the analysis chain you've defined.
>>>>>>>
>>>>>>> You haven't stated whether you've tried this and it's
>>>>>>> not working or you're looking for guidance as to how
>>>>>>> to accomplish this so it's a little unclear how to
>>>>>>> respond.
>>>>>>>
>>>>>>> BTW, the admin/analysis page is your friend here....
>>>>>>>
>>>>>>> Best
>>>>>>> Erick
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>>
>>>>>>>     Clear, thanks for response.
>>>>>>>
>>>>>>>   So, if I have two fields
>>>>>>>> <fieldtype name="type1" class="solr.TextField" >
>>>>>>>>         <analyzer>
>>>>>>>>             <tokenizer class="solr.********KeywordTokenizerFactory*"
>>>>>>>> />
>>>>>>>>
>>>>>>>>             <filter class="solr.********LowerCaseFilterFactory" />
>>>>>>>>             <filter class="solr.********WordDelimiterFilterFactory"
>>>>>>>>
>>>>>>>>
>>>>>>>> generateWordParts="0" generateNumberParts="0"
>>>>>>>>                 catenateWords="0" catenateNumbers="0" catenateAll="1"
>>>>>>>> splitOnCaseChange="0" />
>>>>>>>>         </analyzer>
>>>>>>>> </fieldtype>
>>>>>>>> <fieldtype name="type2" class="solr.TextField" >
>>>>>>>>         <analyzer>
>>>>>>>>             <charFilter class="solr.********MappingCharFilterFactory"
>>>>>>>> mapping="mapping-FoldToASCII.********txt"/>
>>>>>>>>             <tokenizer class="solr.********WhitespaceTokenizerFactory"
>>>>>>>> />
>>>>>>>>             <filter class="solr.********LowerCaseFilterFactory" />
>>>>>>>>             <filter class="solr.********WordDelimiterFilterFactory"
>>>>>>>> .../>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         </analyzer>
>>>>>>>> </fieldtype>
>>>>>>>>
>>>>>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second
>>>>>>>> field
>>>>>>>> type ends with several tokens)
>>>>>>>>
>>>>>>>> And I want to use the same request handler to query against both of
>>>>>>>> them.
>>>>>>>> I mean I want the user search something like
>>>>>>>>
>>>>>>>> http//..../search?q=Mag 78 D 99
>>>>>>>>
>>>>>>>> and this search should search within both the first (with type1) and
>>>>>>>> second (with type 2) by matching
>>>>>>>>
>>>>>>>> - a document which has field_with_type1 equals to *mag78d99* or
>>>>>>>> - a document which has field_with_type2 that contains a text like "go
>>>>>>>> to
>>>>>>>> *mag 78*, class *d* and subclass *99*)
>>>>>>>>
>>>>>>>>
>>>>>>>> <requestHandler ....>
>>>>>>>>         ...
>>>>>>>>         <str name="defType">dismax</str>
>>>>>>>>         ...
>>>>>>>>         <str name="mm">100%</str>
>>>>>>>>         <str name="qf">
>>>>>>>>             field_with_type1
>>>>>>>>             field_with_type_2
>>>>>>>>         </str>
>>>>>>>>         ...
>>>>>>>> </requestHandler>
>>>>>>>>
>>>>>>>> is not possible? If so, is possible to do that in some other way?
>>>>>>>>
>>>>>>>> Sorry for the long email and thanks again
>>>>>>>> Andrea
>>>>>>>>
>>>>>>>>
>>>>>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>>>>>>
>>>>>>>>     Quoted phrases will be passed to the analyzer as one string, so
>>>>>>>> there a
>>>>>>>>
>>>>>>>>   white space tokenizer is needed.
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>> -----Original Message----- From: Andrea Gazzarini
>>>>>>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>> Subject: Re: Tokenization at query time
>>>>>>>>>
>>>>>>>>> Hi Tanguy,
>>>>>>>>> thanks for fast response. What you are saying corresponds perfectly
>>>>>>>>> with
>>>>>>>>> the behaviour I'm observing.
>>>>>>>>> Now, other than having a big problem (I have several other fields
>>>>>>>>> both
>>>>>>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>>>>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>>>>>>
>>>>>>>>> /"The query parser splits the input query on white spaces, and the
>>>>>>>>> each
>>>>>>>>> token is analysed according to your configuration"//
>>>>>>>>> /
>>>>>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>>>>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>>>>>>> tokenized) what is its effect?
>>>>>>>>>
>>>>>>>>> Thank you very much for the help
>>>>>>>>> Andrea
>>>>>>>>>
>>>>>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>>>>>>
>>>>>>>>>     Hello Andrea,
>>>>>>>>>
>>>>>>>>>   I think you face a rather common issue involving keyword
>>>>>>>>>> tokenization
>>>>>>>>>> and query parsing in Lucene:
>>>>>>>>>> The query parser splits the input query on white spaces, and then
>>>>>>>>>> each
>>>>>>>>>> token is analysed according to your configuration.
>>>>>>>>>> So those queries with a whitespace won't behave as expected because
>>>>>>>>>> each
>>>>>>>>>> token is analysed separately. Consequently, the catenated version
>>>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>> reference cannot be generated.
>>>>>>>>>> I think you could try surrounding your query with double quotes or
>>>>>>>>>> escaping the space characters in your query using a backslash so
>>>>>>>>>> that
>>>>>>>>>> the
>>>>>>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>>>>>>> occurs.
>>>>>>>>>> You should be aware that this approach has a drawback: you will
>>>>>>>>>> probably
>>>>>>>>>> not be able to combine the search for Mag. 778 G 69 with other
>>>>>>>>>> words
>>>>>>>>>> in
>>>>>>>>>> other fields unless you are able to identify which spaces are to be
>>>>>>>>>> escaped:
>>>>>>>>>> For example, if input the query is:
>>>>>>>>>> Awesome Mag. 778 G 69
>>>>>>>>>> you would want to transform it to:
>>>>>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference
>>>>>>>>>> only
>>>>>>>>>> or
>>>>>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a
>>>>>>>>>> phrase
>>>>>>>>>> query
>>>>>>>>>>
>>>>>>>>>> Do you get the point?
>>>>>>>>>>
>>>>>>>>>> Look at the differences between what you tried and the following
>>>>>>>>>> examples which should all do what you want:
>>>>>>>>>> http://localhost:8983/solr/********collection1/select?q=%**
>>>>>>>>>> 22Mag.%******
>>>>>>>>>> 20778%20G%2069%22&debugQuery=********on&qf=text%20myfield&****
>>>>>>>>>> defType=**
>>>>>>>>>> **dismax<http://localhost:******8983/solr/collection1/select?***
>>>>>>>>>> ***
>>>>>>>>>> q=%22Mag.%20778%20G%2069%22&******debugQuery=on&qf=text%**
>>>>>>>>>> 20myfield&defType=dismax<http:****//localhost:8983/solr/**
>>>>>>>>>> collection1/select?q=%22Mag.%****20778%20G%2069%22&debugQuery=****
>>>>>>>>>> on&qf=text%20myfield&defType=****dismax<http://localhost:8983/**
>>>>>>>>>> solr/collection1/select?q=%**22Mag.%20778%20G%2069%22&**
>>>>>>>>>> debugQuery=on&qf=text%**20myfield&defType=dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>>>>>>> OR
>>>>>>>>>>
>>>>>>>>>> http://localhost:8983/solr/********collection1/select?q=**
>>>>>>>>>> myfield:**<http://localhost:8983/solr/******collection1/select?q=myfield:**>
>>>>>>>>>> ****Mag<http://localhost:8983/**solr/****collection1/select?q=**
>>>>>>>>>> myfield:****Mag<http://localhost:8983/solr/****collection1/select?q=myfield:****Mag>
>>>>>>>>>> <http://localhost:8983/**solr/****collection1/select?q=****
>>>>>>>>>> myfield:**Mag<http://localhost:8983/**solr/**collection1/select?q=**myfield:**Mag>
>>>>>>>>>> <http://**localhost:8983/solr/****collection1/select?q=myfield:***
>>>>>>>>>> *Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>>>>>>> <http://localhost:8983/**solr/****collection1/select?q=******
>>>>>>>>>> myfield:Mag<http://localhost:8983/**solr/**collection1/select?q=****myfield:Mag>
>>>>>>>>>> <http://localhost:**8983/**solr/collection1/**
>>>>>>>>>> select?q=**myfield:Mag<http://localhost:8983/**solr/collection1/select?q=**myfield:Mag>
>>>>>>>>>> <http://localhost:**8983/solr/**collection1/select?**q=**
>>>>>>>>>> myfield:Mag<http://localhost:**8983/solr/collection1/select?**
>>>>>>>>>> q=myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>>>>>>> .\%20778\%20G\%2069&********debugQuery=on
>>>>>>>>>> OR
>>>>>>>>>> http://localhost:8983/solr/********collection1/select?q=Mag<http://localhost:8983/solr/******collection1/select?q=Mag>
>>>>>>>>>> <ht**tp://localhost:8983/solr/******collection1/select?q=Mag<http://localhost:8983/solr/****collection1/select?q=Mag>
>>>>>>>>>> <http**://localhost:8983/solr/******collection1/select?q=Mag<h**
>>>>>>>>>> ttp://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>>>>>>> <http:**//localhost:8983/solr/******collection1/select?q=Mag<**
>>>>>>>>>> htt**
>>>>>>>>>> p://localhost:8983/solr/****collection1/select?q=Mag<http:**
>>>>>>>>>> //localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>>>>>>> .\%**20778\%20G\%2069&******debugQuery=**on&qf=text%**
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 20myfield&defType=**edismax
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I hope this helps
>>>>>>>>>>
>>>>>>>>>> Tanguy
>>>>>>>>>>
>>>>>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>      Hi all,
>>>>>>>>>>
>>>>>>>>>>    I have a field (among others)in my schema defined like this:
>>>>>>>>>>
>>>>>>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>>>>>>> positionIncrementGap="100">
>>>>>>>>>>>         <analyzer>
>>>>>>>>>>>             <tokenizer class="solr.*********
>>>>>>>>>>> KeywordTokenizerFactory*"
>>>>>>>>>>> />
>>>>>>>>>>>             <filter class="solr.********LowerCaseFilterFactory" />
>>>>>>>>>>>             <filter class="solr.********
>>>>>>>>>>> WordDelimiterFilterFactory"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                 generateWordParts="0"
>>>>>>>>>>>                 generateNumberParts="0"
>>>>>>>>>>>                 catenateWords="0"
>>>>>>>>>>>                 catenateNumbers="0"
>>>>>>>>>>>                 catenateAll="1"
>>>>>>>>>>>                 splitOnCaseChange="0" />
>>>>>>>>>>>         </analyzer>
>>>>>>>>>>> </fieldtype>
>>>>>>>>>>>
>>>>>>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>>>>>>
>>>>>>>>>>> Basically, both at index and query time the field value is
>>>>>>>>>>> normalized
>>>>>>>>>>> like this.
>>>>>>>>>>>
>>>>>>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>>>>>>
>>>>>>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>>>>>>> fossero solo le sue le gambe
>>>>>>>>>>>
>>>>>>>>>>> <requestHandler ....>
>>>>>>>>>>>         ...
>>>>>>>>>>>         <str name="defType">dismax</str>
>>>>>>>>>>>         ...
>>>>>>>>>>>         <str name="mm">100%</str>
>>>>>>>>>>>         <str name="qf">myfield^3000</str>
>>>>>>>>>>>         <str name="pf">myfield^30000</str>
>>>>>>>>>>>
>>>>>>>>>>> </requestHandler>
>>>>>>>>>>>
>>>>>>>>>>> What I'm expecting is that if I index a document with a value for
>>>>>>>>>>> my
>>>>>>>>>>> field "Mag. 778 G 69", I will be able to get this document by
>>>>>>>>>>> querying
>>>>>>>>>>>
>>>>>>>>>>> 1. Mag. 778 G 69
>>>>>>>>>>> 2. mag 778 g69
>>>>>>>>>>> 3. mag778g69
>>>>>>>>>>>
>>>>>>>>>>> But that doesn't wotk: i'm able to get the document only and if
>>>>>>>>>>> only I
>>>>>>>>>>> use the "normalized2 form: mag778g69
>>>>>>>>>>>
>>>>>>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>>>>>>> soemthign like
>>>>>>>>>>> this:
>>>>>>>>>>> /
>>>>>>>>>>> // +((DisjunctionMaxQuery((//********myfield://*mag*//^3000.0)~0.
>>>>>>>>>>> **1)
>>>>>>>>>>> DisjunctionMaxQuery((//********myfield://*778*//^3000.0)~0.1)
>>>>>>>>>>> DisjunctionMaxQuery((//********myfield://*g*//^3000.0)~0.1)
>>>>>>>>>>> DisjunctionMaxQuery((//********myfield://*69*//^3000.0)~0.1))****
>>>>>>>>>>> ****~4)
>>>>>>>>>>> DisjunctionMaxQuery((//********myfield://*mag778g69*//^30000.****
>>>>>>>>>>>
>>>>>>>>>>> ****0)~0.1)/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That is, it is tokenizing the original query string (mag + 778 +
>>>>>>>>>>> g +
>>>>>>>>>>> 69) and obviously querying the field for separate tokens doesn't
>>>>>>>>>>> match
>>>>>>>>>>> anything (at least this is what I think)
>>>>>>>>>>>
>>>>>>>>>>> Does anybody could please explain me that?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance
>>>>>>>>>>> Andrea
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

Re: Tokenization at query time

Posted by Erick Erickson <er...@gmail.com>.

right, edismax is much preferred, dismax hasn't been formally deprecated,
but almost nobody uses it...

I'd be really careful about adding whitespace to the list of escape chars
because it changes the semantics of the search. While it'll work for this
specific case, if you use it in other cases it will change the sense of the
query. This may be OK, but be careful, it might be better to do this
specifically on an as-needed basis...

But you know your problem space best

Best,
Erick


On Mon, Aug 26, 2013 at 9:04 AM, Andrea Gazzarini <
andrea.gazzarini@gmail.com> wrote:

> Hi Erick,
> sorry I forgot the SOLR version...is the 3.6.0
>
> ClientUtils in that version does whitespace escaping:
>
>   public static String escapeQueryChars(String s) {
>     StringBuilder sb = new StringBuilder();
>     for (int i = 0; i < s.length(); i++) {
>       char c = s.charAt(i);
>       // These characters are part of the query syntax and must be escaped
>       if (c == '\\' || c == '+' || c == '-' || c == '!'  || c == '(' || c
> == ')' || c == ':'
>         || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || c
> == '}' || c == '~'
>         || c == '*' || c == '?' || c == '|' || c == '&'  || c == ';'
>         || Character.isWhitespace(c)) {
>         sb.append('\\');
>       }
>       sb.append(c);
>     }
>     return sb.toString();
>   }
>
> Now, I solved the issue but not really sure about that.
>
> Debugging the code I saw that the query string (on the SearchHandler)
>
>
> 978\ 90\ 04\ 23560\ 1
>
> once passed through DismaxQueryParser (specifically through
> SolrPluginUtils.partialEscape(**CharSequence)
>
> becames
>
>
> 978\\ 90\\ 04\\ 23560\\ 1
>
> because that method escapes the backslashes
>
> So, using the eclipse debugger I removed at runtime the additional
> backslash and it works perfectly but of course...I can't do that in
> production for every search :D
>
> So, just to try I changed dismax in edismax which, I saw, doesn't call
> SolrPluginUtils....and it works perfectly!
>
> I saw in your query string that you used edismax too...maybe is that the
> point?
>
> Many thanks
> Andrea
>
>
> On 08/26/2013 02:47 PM, Erick Erickson wrote:
>
>> Andrea:
>>
>> Works for me, admittedly through the browser....
>>
>> I suspect the problem is here: ClientUtils.**escapeQueryChars
>>
>>
>> That doesn't do anything about escaping the spaces, it just handles
>> characters that have special meaning to the query syntax, things like +-
>> etc.
>>
>> Using your field definition, this:
>> http://localhost:8983/solr/**select?wt=json&q=ab\<http://localhost:8983/solr/select?wt=json&q=ab%5C>cd\
>> ef&debug=query&defType=**edismax&qf=name eoe
>> produced this output..
>>
>>     - parsedquery_toString: "+(eoe:abcdef | (name:ab name:cd name:ef))",
>>
>>
>>
>> where the field "eoe" is your isbn_issn type.
>>
>> Best,
>> Erick
>>
>>
>> On Mon, Aug 26, 2013 at 4:55 AM, Andrea Gazzarini <
>> andrea.gazzarini@gmail.com> wrote:
>>
>>  Hi Erick,
>>> escaping spaces doesn't work...
>>>
>>> Briefly,
>>>
>>> - In a document I have an ISBN field that (stored value) is
>>> *978-90-04-23560-1*
>>> - In the index I have this value: *9789004235601*
>>>
>>> Now, I want be able to search the document by using:
>>>
>>> 1) q=*978-90-04-23560-1*
>>> 2) q=*978 90 04 23560 1*
>>> 3) q=*9789004235601*
>>>
>>> 1 and 3 works perfectly, 2 doesn't work.
>>>
>>> My code is:
>>>
>>> /SolrQuery query = new SolrQuery(ClientUtils.****escapeQueryChars(req.**
>>>
>>> getParameter("q")));/
>>>
>>> isbn is declared in this way
>>>
>>> <fieldtype name="isbn_issn" class="solr.TextField"
>>> positionIncrementGap="100">
>>>      <analyzer>
>>>          <tokenizer class="*solr.****KeywordTokenizerFactory*"/>
>>>
>>>
>>>          <filter class="solr.****LowerCaseFilterFactory"/>
>>>          <filter class="solr.****WordDelimiterFilterFactory"
>>> generateWordParts="0" generateNumberParts="0" catenateWords="0"
>>> catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"/>
>>>      </analyzer>
>>> </fieldtype>
>>> <field name="isbn_issn_search" type="issn_isbn" indexed="true"/>
>>> search handler is:
>>>
>>>      <requestHandler name="any_bc" class="solr.SearchHandler"
>>> default="true">
>>>          <lst name="defaults">
>>>              <str name="defType">*dismax*</str>
>>>
>>>              <str name="mm">100%</str>
>>>              <str name="qf">
>>> *isbn_issn_search*^10000
>>>              </str>
>>>              <str name="pf">
>>> *isbn_issn_search*^100000
>>>              </str>
>>>              <int name="ps">0</int>
>>>              <float name="tie">0.1</float>
>>>              ...
>>>      </requestHandler>
>>>
>>> This is what I get:
>>>
>>> *1) 978-90-04-23560-1**
>>> *path=/select params={start=0&q=*978\-90\-****
>>> 04\-23560\-1*&sfield=&qt=any_*
>>> *bc&wt=javabin&rows=10&**version=**2} *hits=1* status=0 QTime=5*
>>>
>>> 2) ***9789004235601*
>>> *webapp=/solr path=/select params={start=0&q=***
>>> 9789004235601*&sfield=&qt=any_****bc&wt=javabin&rows=10&**version=**2}
>>>
>>> *hits=1* status=0 QTime=5*
>>>
>>> 3) **978 90 04 23560 1**
>>> *path=/select params={start=0&*q=978\+90\+****
>>> 04\+23560\+1*&sfield=&qt=any_*
>>> *bc&wt=javabin&rows=10&**version=**2} *hits=0 *status=0 QTime=2*
>>>
>>>
>>> *Extract from queryDebug=true:
>>>
>>> <str name="q">978\ 90\ 04\ 23560\ 1</str>
>>> ...
>>> <str name="rawquerystring">978\ 90\ 04\ 23560\ 1</str>
>>> <str name="querystring">978\ 90\ 04\ 23560\ 1</str>
>>> ...
>>> <str name="parsedquery">
>>>      +((DisjunctionMaxQuery((isbn_****issn_search:*978*^10000.0)~0.**
>>> **1)
>>>      DisjunctionMaxQuery((isbn_****issn_search:*90*^10000.0)~0.1)
>>>      DisjunctionMaxQuery((isbn_****issn_search:*04*^10000.0)~0.1)
>>>      DisjunctionMaxQuery((isbn_****issn_search:*23560*^10000.0)~****0.1)
>>>      DisjunctionMaxQuery((isbn_****issn_search:*1*^10000.0)~0.1))****~5)
>>>      DisjunctionMaxQuery((isbn_****issn_search:*9789004235601*^**
>>> 100000.0)~0.1)
>>> </str>
>>>
>>> ------------------------------****------------------
>>>
>>> Probably this is a very stupid question but I'm going crazy. In this page
>>>
>>> http://wiki.apache.org/solr/****DisMaxQParserPlugin<http://wiki.apache.org/solr/**DisMaxQParserPlugin>
>>> <http://**wiki.apache.org/solr/**DisMaxQParserPlugin<http://wiki.apache.org/solr/DisMaxQParserPlugin>
>>> >
>>>
>>>
>>> *Query Structure*
>>>
>>> /For each "word" in the query string, dismax builds a DisjunctionMaxQuery
>>> object for that word across all of the fields in the //qf//param...
>>>
>>> /And seems exactly what it is doing...but what is a "word"? How can I
>>> force//(without using double quotes) spaces in a way that they are
>>> considered part of the word/?
>>>
>>> /Many many many thanks
>>> Andrea
>>>
>>>
>>> On 08/13/2013 04:18 PM, Erick Erickson wrote:
>>>
>>>  I think you can get what you want by escaping the space with a
>>>> backslash....
>>>>
>>>> YMMV of course.
>>>> Erick
>>>>
>>>>
>>>> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
>>>> andrea.gazzarini@gmail.com> wrote:
>>>>
>>>>   Hi Erick,
>>>>
>>>>> sorry if that wasn't clear: this is what I'm actually observing in my
>>>>> application.
>>>>>
>>>>> I wrote the first post after looking at the explain (debugQuery=true):
>>>>> the
>>>>> query
>>>>>
>>>>> q=mag 778 G 69
>>>>>
>>>>> is translated as follow:
>>>>>
>>>>>
>>>>> /  +((DisjunctionMaxQuery((//******myfield://*mag*//^3000.0)~0.1)
>>>>>         DisjunctionMaxQuery((//******myfield://*778*//^3000.0)~0.1)
>>>>>         DisjunctionMaxQuery((//******myfield://*g*//^3000.0)~0.1)
>>>>>         DisjunctionMaxQuery((//******myfield://*69*//^3000.0)~0.1))**
>>>>> ****~4)
>>>>>         DisjunctionMaxQuery((//******myfield://*mag778g69*//^30000.**
>>>>> **
>>>>> **0)~0.1)/
>>>>>
>>>>> It seems that althouhg I declare myfield with this type
>>>>>
>>>>> /<fieldtype name="type1" class="solr.TextField" >
>>>>>
>>>>>       <analyzer>
>>>>>           <tokenizer class="solr.******KeywordTokenizerFactory*" />
>>>>>
>>>>>           <filter class="solr.******LowerCaseFilterFactory" />
>>>>>           <filter class="solr.******WordDelimiterFilterFactory"
>>>>> generateWordParts="0" generateNumberParts="0"
>>>>>               catenateWords="0" catenateNumbers="0" catenateAll="1"****
>>>>>
>>>>> splitOnCaseChange="0"
>>>>>
>>>>> />
>>>>>       </analyzer>
>>>>> </fieldtype>
>>>>>
>>>>> /SOLR is tokenizing it therefore by producing several tokens
>>>>> (mag,778,g,69)/
>>>>> /
>>>>>
>>>>> And I can't put double quotes on the query (q="mag 778 G 69") because
>>>>> the
>>>>> request handler searches also in other fields (with different
>>>>> configuration
>>>>> chains)
>>>>>
>>>>> As I understood the query parser, (i.e. query time), does a whitespace
>>>>> tokenization on its own before invoking my (query-time) chain. The same
>>>>> doesn't happen at index time...this is my problem...because at index
>>>>> time
>>>>> the field is analyzed exactly as I want...but unfortunately cannot say
>>>>> the
>>>>> same at query time.
>>>>>
>>>>> Sorry for my wonderful english, did you get the point?
>>>>>
>>>>>
>>>>> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>>>>>
>>>>>   On a quick scan I don't see a problem here. Attach
>>>>>
>>>>>> &debug=query to your url and that'll show you the
>>>>>> parsed query, which will in turn show you what's been
>>>>>> pushed through the analysis chain you've defined.
>>>>>>
>>>>>> You haven't stated whether you've tried this and it's
>>>>>> not working or you're looking for guidance as to how
>>>>>> to accomplish this so it's a little unclear how to
>>>>>> respond.
>>>>>>
>>>>>> BTW, the admin/analysis page is your friend here....
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>
>>>>>>    Clear, thanks for response.
>>>>>>
>>>>>>  So, if I have two fields
>>>>>>>
>>>>>>> <fieldtype name="type1" class="solr.TextField" >
>>>>>>>        <analyzer>
>>>>>>>            <tokenizer class="solr.********KeywordTokenizerFactory*"
>>>>>>> />
>>>>>>>
>>>>>>>            <filter class="solr.********LowerCaseFilterFactory" />
>>>>>>>            <filter class="solr.********WordDelimiterFilterFactory"
>>>>>>>
>>>>>>>
>>>>>>> generateWordParts="0" generateNumberParts="0"
>>>>>>>                catenateWords="0" catenateNumbers="0" catenateAll="1"
>>>>>>> splitOnCaseChange="0" />
>>>>>>>        </analyzer>
>>>>>>> </fieldtype>
>>>>>>> <fieldtype name="type2" class="solr.TextField" >
>>>>>>>        <analyzer>
>>>>>>>            <charFilter class="solr.********MappingCharFilterFactory"
>>>>>>> mapping="mapping-FoldToASCII.********txt"/>
>>>>>>>            <tokenizer class="solr.********WhitespaceTokenizerFactory"
>>>>>>> />
>>>>>>>            <filter class="solr.********LowerCaseFilterFactory" />
>>>>>>>            <filter class="solr.********WordDelimiterFilterFactory"
>>>>>>> .../>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>        </analyzer>
>>>>>>> </fieldtype>
>>>>>>>
>>>>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second
>>>>>>> field
>>>>>>> type ends with several tokens)
>>>>>>>
>>>>>>> And I want to use the same request handler to query against both of
>>>>>>> them.
>>>>>>> I mean I want the user search something like
>>>>>>>
>>>>>>> http//..../search?q=Mag 78 D 99
>>>>>>>
>>>>>>> and this search should search within both the first (with type1) and
>>>>>>> second (with type 2) by matching
>>>>>>>
>>>>>>> - a document which has field_with_type1 equals to *mag78d99* or
>>>>>>> - a document which has field_with_type2 that contains a text like "go
>>>>>>> to
>>>>>>> *mag 78*, class *d* and subclass *99*)
>>>>>>>
>>>>>>>
>>>>>>> <requestHandler ....>
>>>>>>>        ...
>>>>>>>        <str name="defType">dismax</str>
>>>>>>>        ...
>>>>>>>        <str name="mm">100%</str>
>>>>>>>        <str name="qf">
>>>>>>>            field_with_type1
>>>>>>>            field_with_type_2
>>>>>>>        </str>
>>>>>>>        ...
>>>>>>> </requestHandler>
>>>>>>>
>>>>>>> is not possible? If so, is possible to do that in some other way?
>>>>>>>
>>>>>>> Sorry for the long email and thanks again
>>>>>>> Andrea
>>>>>>>
>>>>>>>
>>>>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>>>>>
>>>>>>>    Quoted phrases will be passed to the analyzer as one string, so
>>>>>>> there a
>>>>>>>
>>>>>>>  white space tokenizer is needed.
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> -----Original Message----- From: Andrea Gazzarini
>>>>>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>> Subject: Re: Tokenization at query time
>>>>>>>>
>>>>>>>> Hi Tanguy,
>>>>>>>> thanks for fast response. What you are saying corresponds perfectly
>>>>>>>> with
>>>>>>>> the behaviour I'm observing.
>>>>>>>> Now, other than having a big problem (I have several other fields
>>>>>>>> both
>>>>>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>>>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>>>>>
>>>>>>>> /"The query parser splits the input query on white spaces, and the
>>>>>>>> each
>>>>>>>> token is analysed according to your configuration"//
>>>>>>>> /
>>>>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>>>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>>>>>> tokenized) what is its effect?
>>>>>>>>
>>>>>>>> Thank you very much for the help
>>>>>>>> Andrea
>>>>>>>>
>>>>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>>>>>
>>>>>>>>    Hello Andrea,
>>>>>>>>
>>>>>>>>  I think you face a rather common issue involving keyword
>>>>>>>>> tokenization
>>>>>>>>> and query parsing in Lucene:
>>>>>>>>> The query parser splits the input query on white spaces, and then
>>>>>>>>> each
>>>>>>>>> token is analysed according to your configuration.
>>>>>>>>> So those queries with a whitespace won't behave as expected because
>>>>>>>>> each
>>>>>>>>> token is analysed separately. Consequently, the catenated version
>>>>>>>>> of
>>>>>>>>> the
>>>>>>>>> reference cannot be generated.
>>>>>>>>> I think you could try surrounding your query with double quotes or
>>>>>>>>> escaping the space characters in your query using a backslash so
>>>>>>>>> that
>>>>>>>>> the
>>>>>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>>>>>> occurs.
>>>>>>>>> You should be aware that this approach has a drawback: you will
>>>>>>>>> probably
>>>>>>>>> not be able to combine the search for Mag. 778 G 69 with other
>>>>>>>>> words
>>>>>>>>> in
>>>>>>>>> other fields unless you are able to identify which spaces are to be
>>>>>>>>> escaped:
>>>>>>>>> For example, if input the query is:
>>>>>>>>> Awesome Mag. 778 G 69
>>>>>>>>> you would want to transform it to:
>>>>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference
>>>>>>>>> only
>>>>>>>>> or
>>>>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a
>>>>>>>>> phrase
>>>>>>>>> query
>>>>>>>>>
>>>>>>>>> Do you get the point?
>>>>>>>>>
>>>>>>>>> Look at the differences between what you tried and the following
>>>>>>>>> examples which should all do what you want:
>>>>>>>>> http://localhost:8983/solr/********collection1/select?q=%**
>>>>>>>>> 22Mag.%******
>>>>>>>>> 20778%20G%2069%22&debugQuery=********on&qf=text%20myfield&****
>>>>>>>>> defType=**
>>>>>>>>> **dismax<http://localhost:******8983/solr/collection1/select?***
>>>>>>>>> ***
>>>>>>>>> q=%22Mag.%20778%20G%2069%22&******debugQuery=on&qf=text%**
>>>>>>>>> 20myfield&defType=dismax<http:****//localhost:8983/solr/**
>>>>>>>>> collection1/select?q=%22Mag.%****20778%20G%2069%22&debugQuery=****
>>>>>>>>> on&qf=text%20myfield&defType=****dismax<http://localhost:8983/**
>>>>>>>>> solr/collection1/select?q=%**22Mag.%20778%20G%2069%22&**
>>>>>>>>> debugQuery=on&qf=text%**20myfield&defType=dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>>>>>> >
>>>>>>>>> OR
>>>>>>>>>
>>>>>>>>> http://localhost:8983/solr/********collection1/select?q=**
>>>>>>>>> myfield:**<http://localhost:8983/solr/******collection1/select?q=myfield:**>
>>>>>>>>> ****Mag<http://localhost:8983/**solr/****collection1/select?q=**
>>>>>>>>> myfield:****Mag<http://localhost:8983/solr/****collection1/select?q=myfield:****Mag>
>>>>>>>>> >
>>>>>>>>> <http://localhost:8983/**solr/****collection1/select?q=****
>>>>>>>>> myfield:**Mag<http://localhost:8983/**solr/**collection1/select?q=**myfield:**Mag>
>>>>>>>>> <http://**localhost:8983/solr/****collection1/select?q=myfield:***
>>>>>>>>> *Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>>>>>> >
>>>>>>>>> <http://localhost:8983/**solr/****collection1/select?q=******
>>>>>>>>> myfield:Mag<http://localhost:8983/**solr/**collection1/select?q=****myfield:Mag>
>>>>>>>>> <http://localhost:**8983/**solr/collection1/**
>>>>>>>>> select?q=**myfield:Mag<http://localhost:8983/**solr/collection1/select?q=**myfield:Mag>
>>>>>>>>> >
>>>>>>>>> <http://localhost:**8983/solr/**collection1/select?**q=**
>>>>>>>>> myfield:Mag<http://localhost:**8983/solr/collection1/select?**
>>>>>>>>> q=myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>>>>>> >
>>>>>>>>> .\%20778\%20G\%2069&********debugQuery=on
>>>>>>>>> OR
>>>>>>>>> http://localhost:8983/solr/********collection1/select?q=Mag<http://localhost:8983/solr/******collection1/select?q=Mag>
>>>>>>>>> <ht**tp://localhost:8983/solr/******collection1/select?q=Mag<http://localhost:8983/solr/****collection1/select?q=Mag>
>>>>>>>>> >
>>>>>>>>> <http**://localhost:8983/solr/******collection1/select?q=Mag<h**
>>>>>>>>> ttp://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>>>>>> >
>>>>>>>>> <http:**//localhost:8983/solr/******collection1/select?q=Mag<**
>>>>>>>>> htt**
>>>>>>>>> p://localhost:8983/solr/****collection1/select?q=Mag<http:**
>>>>>>>>> //localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>>>>>> >
>>>>>>>>> .\%**20778\%20G\%2069&******debugQuery=**on&qf=text%**
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 20myfield&defType=**edismax
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I hope this helps
>>>>>>>>>
>>>>>>>>> Tanguy
>>>>>>>>>
>>>>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>     Hi all,
>>>>>>>>>
>>>>>>>>>   I have a field (among others)in my schema defined like this:
>>>>>>>>>
>>>>>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>>>>>> positionIncrementGap="100">
>>>>>>>>>>        <analyzer>
>>>>>>>>>>            <tokenizer class="solr.*********
>>>>>>>>>> KeywordTokenizerFactory*"
>>>>>>>>>> />
>>>>>>>>>>            <filter class="solr.********LowerCaseFilterFactory" />
>>>>>>>>>>            <filter class="solr.********
>>>>>>>>>> WordDelimiterFilterFactory"
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                generateWordParts="0"
>>>>>>>>>>                generateNumberParts="0"
>>>>>>>>>>                catenateWords="0"
>>>>>>>>>>                catenateNumbers="0"
>>>>>>>>>>                catenateAll="1"
>>>>>>>>>>                splitOnCaseChange="0" />
>>>>>>>>>>        </analyzer>
>>>>>>>>>> </fieldtype>
>>>>>>>>>>
>>>>>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>>>>>
>>>>>>>>>> Basically, both at index and query time the field value is
>>>>>>>>>> normalized
>>>>>>>>>> like this.
>>>>>>>>>>
>>>>>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>>>>>
>>>>>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>>>>>> fossero solo le sue le gambe
>>>>>>>>>>
>>>>>>>>>> <requestHandler ....>
>>>>>>>>>>        ...
>>>>>>>>>>        <str name="defType">dismax</str>
>>>>>>>>>>        ...
>>>>>>>>>>        <str name="mm">100%</str>
>>>>>>>>>>        <str name="qf">myfield^3000</str>
>>>>>>>>>>        <str name="pf">myfield^30000</str>
>>>>>>>>>>
>>>>>>>>>> </requestHandler>
>>>>>>>>>>
>>>>>>>>>> What I'm expecting is that if I index a document with a value for
>>>>>>>>>> my
>>>>>>>>>> field "Mag. 778 G 69", I will be able to get this document by
>>>>>>>>>> querying
>>>>>>>>>>
>>>>>>>>>> 1. Mag. 778 G 69
>>>>>>>>>> 2. mag 778 g69
>>>>>>>>>> 3. mag778g69
>>>>>>>>>>
>>>>>>>>>> But that doesn't wotk: i'm able to get the document only and if
>>>>>>>>>> only I
>>>>>>>>>> use the "normalized2 form: mag778g69
>>>>>>>>>>
>>>>>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>>>>>> soemthign like
>>>>>>>>>> this:
>>>>>>>>>> /
>>>>>>>>>> // +((DisjunctionMaxQuery((//********myfield://*mag*//^3000.0)~0.
>>>>>>>>>> **1)
>>>>>>>>>> DisjunctionMaxQuery((//********myfield://*778*//^3000.0)~0.1)
>>>>>>>>>> DisjunctionMaxQuery((//********myfield://*g*//^3000.0)~0.1)
>>>>>>>>>> DisjunctionMaxQuery((//********myfield://*69*//^3000.0)~0.1))****
>>>>>>>>>> ****~4)
>>>>>>>>>> DisjunctionMaxQuery((//********myfield://*mag778g69*//^30000.****
>>>>>>>>>>
>>>>>>>>>> ****0)~0.1)/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That is, it is tokenizing the original query string (mag + 778 +
>>>>>>>>>> g +
>>>>>>>>>> 69) and obviously querying the field for separate tokens doesn't
>>>>>>>>>> match
>>>>>>>>>> anything (at least this is what I think)
>>>>>>>>>>
>>>>>>>>>> Does anybody could please explain me that?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>> Andrea
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: Tokenization at query time

Posted by Andrea Gazzarini <an...@gmail.com>.

Hi Erick,
sorry I forgot the SOLR version...is the 3.6.0

ClientUtils in that version does whitespace escaping:

   public static String escapeQueryChars(String s) {
     StringBuilder sb = new StringBuilder();
     for (int i = 0; i < s.length(); i++) {
       char c = s.charAt(i);
       // These characters are part of the query syntax and must be escaped
       if (c == '\\' || c == '+' || c == '-' || c == '!'  || c == '(' || 
c == ')' || c == ':'
         || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || 
c == '}' || c == '~'
         || c == '*' || c == '?' || c == '|' || c == '&'  || c == ';'
         || Character.isWhitespace(c)) {
         sb.append('\\');
       }
       sb.append(c);
     }
     return sb.toString();
   }

Now, I solved the issue but not really sure about that.

Debugging the code I saw that the query string (on the SearchHandler)

978\ 90\ 04\ 23560\ 1

once passed through DismaxQueryParser (specifically through SolrPluginUtils.partialEscape(CharSequence)

becames

978\\ 90\\ 04\\ 23560\\ 1

because that method escapes the backslashes

So, using the eclipse debugger I removed at runtime the additional backslash and it works perfectly but of course...I can't do that in production for every search :D

So, just to try I changed dismax in edismax which, I saw, doesn't call SolrPluginUtils....and it works perfectly!

I saw in your query string that you used edismax too...maybe is that the point?

Many thanks
Andrea

On 08/26/2013 02:47 PM, Erick Erickson wrote:
> Andrea:
>
> Works for me, admittedly through the browser....
>
> I suspect the problem is here: ClientUtils.**escapeQueryChars
>
> That doesn't do anything about escaping the spaces, it just handles
> characters that have special meaning to the query syntax, things like +-
> etc.
>
> Using your field definition, this:
> http://localhost:8983/solr/select?wt=json&q=ab\ cd\
> ef&debug=query&defType=edismax&qf=name eoe
> produced this output..
>
>     - parsedquery_toString: "+(eoe:abcdef | (name:ab name:cd name:ef))",
>
>
> where the field "eoe" is your isbn_issn type.
>
> Best,
> Erick
>
>
> On Mon, Aug 26, 2013 at 4:55 AM, Andrea Gazzarini <
> andrea.gazzarini@gmail.com> wrote:
>
>> Hi Erick,
>> escaping spaces doesn't work...
>>
>> Briefly,
>>
>> - In a document I have an ISBN field that (stored value) is
>> *978-90-04-23560-1*
>> - In the index I have this value: *9789004235601*
>>
>> Now, I want be able to search the document by using:
>>
>> 1) q=*978-90-04-23560-1*
>> 2) q=*978 90 04 23560 1*
>> 3) q=*9789004235601*
>>
>> 1 and 3 works perfectly, 2 doesn't work.
>>
>> My code is:
>>
>> /SolrQuery query = new SolrQuery(ClientUtils.**escapeQueryChars(req.**
>> getParameter("q")));/
>>
>> isbn is declared in this way
>>
>> <fieldtype name="isbn_issn" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer>
>>          <tokenizer class="*solr.**KeywordTokenizerFactory*"/>
>>
>>          <filter class="solr.**LowerCaseFilterFactory"/>
>>          <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="0"
>> catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"/>
>>      </analyzer>
>> </fieldtype>
>> <field name="isbn_issn_search" type="issn_isbn" indexed="true"/>
>> search handler is:
>>
>>      <requestHandler name="any_bc" class="solr.SearchHandler"
>> default="true">
>>          <lst name="defaults">
>>              <str name="defType">*dismax*</str>
>>
>>              <str name="mm">100%</str>
>>              <str name="qf">
>> *isbn_issn_search*^10000
>>              </str>
>>              <str name="pf">
>> *isbn_issn_search*^100000
>>              </str>
>>              <int name="ps">0</int>
>>              <float name="tie">0.1</float>
>>              ...
>>      </requestHandler>
>>
>> This is what I get:
>>
>> *1) 978-90-04-23560-1**
>> *path=/select params={start=0&q=*978\-90\-**04\-23560\-1*&sfield=&qt=any_*
>> *bc&wt=javabin&rows=10&version=**2} *hits=1* status=0 QTime=5*
>>
>> 2) ***9789004235601*
>> *webapp=/solr path=/select params={start=0&q=***
>> 9789004235601*&sfield=&qt=any_**bc&wt=javabin&rows=10&version=**2}
>> *hits=1* status=0 QTime=5*
>>
>> 3) **978 90 04 23560 1**
>> *path=/select params={start=0&*q=978\+90\+**04\+23560\+1*&sfield=&qt=any_*
>> *bc&wt=javabin&rows=10&version=**2} *hits=0 *status=0 QTime=2*
>>
>> *Extract from queryDebug=true:
>>
>> <str name="q">978\ 90\ 04\ 23560\ 1</str>
>> ...
>> <str name="rawquerystring">978\ 90\ 04\ 23560\ 1</str>
>> <str name="querystring">978\ 90\ 04\ 23560\ 1</str>
>> ...
>> <str name="parsedquery">
>>      +((DisjunctionMaxQuery((isbn_**issn_search:*978*^10000.0)~0.**1)
>>      DisjunctionMaxQuery((isbn_**issn_search:*90*^10000.0)~0.1)
>>      DisjunctionMaxQuery((isbn_**issn_search:*04*^10000.0)~0.1)
>>      DisjunctionMaxQuery((isbn_**issn_search:*23560*^10000.0)~**0.1)
>>      DisjunctionMaxQuery((isbn_**issn_search:*1*^10000.0)~0.1))**~5)
>>      DisjunctionMaxQuery((isbn_**issn_search:*9789004235601*^**
>> 100000.0)~0.1)
>> </str>
>>
>> ------------------------------**------------------
>> Probably this is a very stupid question but I'm going crazy. In this page
>>
>> http://wiki.apache.org/solr/**DisMaxQParserPlugin<http://wiki.apache.org/solr/DisMaxQParserPlugin>
>>
>> *Query Structure*
>>
>> /For each "word" in the query string, dismax builds a DisjunctionMaxQuery
>> object for that word across all of the fields in the //qf//param...
>>
>> /And seems exactly what it is doing...but what is a "word"? How can I
>> force//(without using double quotes) spaces in a way that they are
>> considered part of the word/?
>>
>> /Many many many thanks
>> Andrea
>>
>>
>> On 08/13/2013 04:18 PM, Erick Erickson wrote:
>>
>>> I think you can get what you want by escaping the space with a
>>> backslash....
>>>
>>> YMMV of course.
>>> Erick
>>>
>>>
>>> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
>>> andrea.gazzarini@gmail.com> wrote:
>>>
>>>   Hi Erick,
>>>> sorry if that wasn't clear: this is what I'm actually observing in my
>>>> application.
>>>>
>>>> I wrote the first post after looking at the explain (debugQuery=true):
>>>> the
>>>> query
>>>>
>>>> q=mag 778 G 69
>>>>
>>>> is translated as follow:
>>>>
>>>>
>>>> /  +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1)
>>>>         DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1)
>>>>         DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1)
>>>>         DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4)
>>>>         DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.**
>>>> **0)~0.1)/
>>>>
>>>> It seems that althouhg I declare myfield with this type
>>>>
>>>> /<fieldtype name="type1" class="solr.TextField" >
>>>>
>>>>       <analyzer>
>>>>           <tokenizer class="solr.****KeywordTokenizerFactory*" />
>>>>
>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>           <filter class="solr.****WordDelimiterFilterFactory"
>>>> generateWordParts="0" generateNumberParts="0"
>>>>               catenateWords="0" catenateNumbers="0" catenateAll="1"****
>>>> splitOnCaseChange="0"
>>>>
>>>> />
>>>>       </analyzer>
>>>> </fieldtype>
>>>>
>>>> /SOLR is tokenizing it therefore by producing several tokens
>>>> (mag,778,g,69)/
>>>> /
>>>>
>>>> And I can't put double quotes on the query (q="mag 778 G 69") because the
>>>> request handler searches also in other fields (with different
>>>> configuration
>>>> chains)
>>>>
>>>> As I understood the query parser, (i.e. query time), does a whitespace
>>>> tokenization on its own before invoking my (query-time) chain. The same
>>>> doesn't happen at index time...this is my problem...because at index time
>>>> the field is analyzed exactly as I want...but unfortunately cannot say
>>>> the
>>>> same at query time.
>>>>
>>>> Sorry for my wonderful english, did you get the point?
>>>>
>>>>
>>>> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>>>>
>>>>   On a quick scan I don't see a problem here. Attach
>>>>> &debug=query to your url and that'll show you the
>>>>> parsed query, which will in turn show you what's been
>>>>> pushed through the analysis chain you've defined.
>>>>>
>>>>> You haven't stated whether you've tried this and it's
>>>>> not working or you're looking for guidance as to how
>>>>> to accomplish this so it's a little unclear how to
>>>>> respond.
>>>>>
>>>>> BTW, the admin/analysis page is your friend here....
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>>
>>>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>
>>>>>    Clear, thanks for response.
>>>>>
>>>>>> So, if I have two fields
>>>>>>
>>>>>> <fieldtype name="type1" class="solr.TextField" >
>>>>>>        <analyzer>
>>>>>>            <tokenizer class="solr.******KeywordTokenizerFactory*" />
>>>>>>
>>>>>>            <filter class="solr.******LowerCaseFilterFactory" />
>>>>>>            <filter class="solr.******WordDelimiterFilterFactory"
>>>>>>
>>>>>> generateWordParts="0" generateNumberParts="0"
>>>>>>                catenateWords="0" catenateNumbers="0" catenateAll="1"
>>>>>> splitOnCaseChange="0" />
>>>>>>        </analyzer>
>>>>>> </fieldtype>
>>>>>> <fieldtype name="type2" class="solr.TextField" >
>>>>>>        <analyzer>
>>>>>>            <charFilter class="solr.******MappingCharFilterFactory"
>>>>>> mapping="mapping-FoldToASCII.******txt"/>
>>>>>>            <tokenizer class="solr.******WhitespaceTokenizerFactory" />
>>>>>>            <filter class="solr.******LowerCaseFilterFactory" />
>>>>>>            <filter class="solr.******WordDelimiterFilterFactory" .../>
>>>>>>
>>>>>>
>>>>>>        </analyzer>
>>>>>> </fieldtype>
>>>>>>
>>>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>>>>>> type ends with several tokens)
>>>>>>
>>>>>> And I want to use the same request handler to query against both of
>>>>>> them.
>>>>>> I mean I want the user search something like
>>>>>>
>>>>>> http//..../search?q=Mag 78 D 99
>>>>>>
>>>>>> and this search should search within both the first (with type1) and
>>>>>> second (with type 2) by matching
>>>>>>
>>>>>> - a document which has field_with_type1 equals to *mag78d99* or
>>>>>> - a document which has field_with_type2 that contains a text like "go
>>>>>> to
>>>>>> *mag 78*, class *d* and subclass *99*)
>>>>>>
>>>>>>
>>>>>> <requestHandler ....>
>>>>>>        ...
>>>>>>        <str name="defType">dismax</str>
>>>>>>        ...
>>>>>>        <str name="mm">100%</str>
>>>>>>        <str name="qf">
>>>>>>            field_with_type1
>>>>>>            field_with_type_2
>>>>>>        </str>
>>>>>>        ...
>>>>>> </requestHandler>
>>>>>>
>>>>>> is not possible? If so, is possible to do that in some other way?
>>>>>>
>>>>>> Sorry for the long email and thanks again
>>>>>> Andrea
>>>>>>
>>>>>>
>>>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>>>>
>>>>>>    Quoted phrases will be passed to the analyzer as one string, so
>>>>>> there a
>>>>>>
>>>>>>> white space tokenizer is needed.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> -----Original Message----- From: Andrea Gazzarini
>>>>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: Re: Tokenization at query time
>>>>>>>
>>>>>>> Hi Tanguy,
>>>>>>> thanks for fast response. What you are saying corresponds perfectly
>>>>>>> with
>>>>>>> the behaviour I'm observing.
>>>>>>> Now, other than having a big problem (I have several other fields both
>>>>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>>>>
>>>>>>> /"The query parser splits the input query on white spaces, and the
>>>>>>> each
>>>>>>> token is analysed according to your configuration"//
>>>>>>> /
>>>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>>>>> tokenized) what is its effect?
>>>>>>>
>>>>>>> Thank you very much for the help
>>>>>>> Andrea
>>>>>>>
>>>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>>>>
>>>>>>>    Hello Andrea,
>>>>>>>
>>>>>>>> I think you face a rather common issue involving keyword tokenization
>>>>>>>> and query parsing in Lucene:
>>>>>>>> The query parser splits the input query on white spaces, and then
>>>>>>>> each
>>>>>>>> token is analysed according to your configuration.
>>>>>>>> So those queries with a whitespace won't behave as expected because
>>>>>>>> each
>>>>>>>> token is analysed separately. Consequently, the catenated version of
>>>>>>>> the
>>>>>>>> reference cannot be generated.
>>>>>>>> I think you could try surrounding your query with double quotes or
>>>>>>>> escaping the space characters in your query using a backslash so that
>>>>>>>> the
>>>>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>>>>> occurs.
>>>>>>>> You should be aware that this approach has a drawback: you will
>>>>>>>> probably
>>>>>>>> not be able to combine the search for Mag. 778 G 69 with other words
>>>>>>>> in
>>>>>>>> other fields unless you are able to identify which spaces are to be
>>>>>>>> escaped:
>>>>>>>> For example, if input the query is:
>>>>>>>> Awesome Mag. 778 G 69
>>>>>>>> you would want to transform it to:
>>>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>>>>>> or
>>>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>>>>>> query
>>>>>>>>
>>>>>>>> Do you get the point?
>>>>>>>>
>>>>>>>> Look at the differences between what you tried and the following
>>>>>>>> examples which should all do what you want:
>>>>>>>> http://localhost:8983/solr/******collection1/select?q=%22Mag.%******
>>>>>>>> 20778%20G%2069%22&debugQuery=******on&qf=text%20myfield&**defType=**
>>>>>>>> **dismax<http://localhost:****8983/solr/collection1/select?****
>>>>>>>> q=%22Mag.%20778%20G%2069%22&****debugQuery=on&qf=text%**
>>>>>>>> 20myfield&defType=dismax<http:**//localhost:8983/solr/**
>>>>>>>> collection1/select?q=%22Mag.%**20778%20G%2069%22&debugQuery=**
>>>>>>>> on&qf=text%20myfield&defType=**dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>>>>> OR
>>>>>>>>
>>>>>>>> http://localhost:8983/solr/******collection1/select?q=myfield:**
>>>>>>>> ****Mag<http://localhost:8983/solr/****collection1/select?q=myfield:****Mag>
>>>>>>>> <http://localhost:8983/**solr/**collection1/select?q=**myfield:**Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>>>>> <http://localhost:8983/**solr/**collection1/select?q=****myfield:Mag<http://localhost:8983/**solr/collection1/select?q=**myfield:Mag>
>>>>>>>> <http://localhost:**8983/solr/collection1/select?**q=myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>>>>> .\%20778\%20G\%2069&******debugQuery=on
>>>>>>>> OR
>>>>>>>> http://localhost:8983/solr/******collection1/select?q=Mag<http://localhost:8983/solr/****collection1/select?q=Mag>
>>>>>>>> <http**://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>>>>> <http:**//localhost:8983/solr/****collection1/select?q=Mag<htt**
>>>>>>>> p://localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>>>>> .\%**20778\%20G\%2069&****debugQuery=**on&qf=text%**
>>>>>>>>
>>>>>>>> 20myfield&defType=**edismax
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I hope this helps
>>>>>>>>
>>>>>>>> Tanguy
>>>>>>>>
>>>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>>>
>>>>>>>>     Hi all,
>>>>>>>>
>>>>>>>>   I have a field (among others)in my schema defined like this:
>>>>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>>>>> positionIncrementGap="100">
>>>>>>>>>        <analyzer>
>>>>>>>>>            <tokenizer class="solr.*******KeywordTokenizerFactory*"
>>>>>>>>> />
>>>>>>>>>            <filter class="solr.******LowerCaseFilterFactory" />
>>>>>>>>>            <filter class="solr.******WordDelimiterFilterFactory"
>>>>>>>>>
>>>>>>>>>                generateWordParts="0"
>>>>>>>>>                generateNumberParts="0"
>>>>>>>>>                catenateWords="0"
>>>>>>>>>                catenateNumbers="0"
>>>>>>>>>                catenateAll="1"
>>>>>>>>>                splitOnCaseChange="0" />
>>>>>>>>>        </analyzer>
>>>>>>>>> </fieldtype>
>>>>>>>>>
>>>>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>>>>
>>>>>>>>> Basically, both at index and query time the field value is
>>>>>>>>> normalized
>>>>>>>>> like this.
>>>>>>>>>
>>>>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>>>>
>>>>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>>>>> fossero solo le sue le gambe
>>>>>>>>>
>>>>>>>>> <requestHandler ....>
>>>>>>>>>        ...
>>>>>>>>>        <str name="defType">dismax</str>
>>>>>>>>>        ...
>>>>>>>>>        <str name="mm">100%</str>
>>>>>>>>>        <str name="qf">myfield^3000</str>
>>>>>>>>>        <str name="pf">myfield^30000</str>
>>>>>>>>>
>>>>>>>>> </requestHandler>
>>>>>>>>>
>>>>>>>>> What I'm expecting is that if I index a document with a value for my
>>>>>>>>> field "Mag. 778 G 69", I will be able to get this document by
>>>>>>>>> querying
>>>>>>>>>
>>>>>>>>> 1. Mag. 778 G 69
>>>>>>>>> 2. mag 778 g69
>>>>>>>>> 3. mag778g69
>>>>>>>>>
>>>>>>>>> But that doesn't wotk: i'm able to get the document only and if
>>>>>>>>> only I
>>>>>>>>> use the "normalized2 form: mag778g69
>>>>>>>>>
>>>>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>>>>> soemthign like
>>>>>>>>> this:
>>>>>>>>> /
>>>>>>>>> // +((DisjunctionMaxQuery((//******myfield://*mag*//^3000.0)~0.1)
>>>>>>>>> DisjunctionMaxQuery((//******myfield://*778*//^3000.0)~0.1)
>>>>>>>>> DisjunctionMaxQuery((//******myfield://*g*//^3000.0)~0.1)
>>>>>>>>> DisjunctionMaxQuery((//******myfield://*69*//^3000.0)~0.1))**
>>>>>>>>> ****~4)
>>>>>>>>> DisjunctionMaxQuery((//******myfield://*mag778g69*//^30000.**
>>>>>>>>> ****0)~0.1)/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>>>>>>> 69) and obviously querying the field for separate tokens doesn't
>>>>>>>>> match
>>>>>>>>> anything (at least this is what I think)
>>>>>>>>>
>>>>>>>>> Does anybody could please explain me that?
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>> Andrea
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: Tokenization at query time

Posted by Erick Erickson <er...@gmail.com>.

Andrea:

Works for me, admittedly through the browser....

I suspect the problem is here: ClientUtils.**escapeQueryChars

That doesn't do anything about escaping the spaces, it just handles
characters that have special meaning to the query syntax, things like +-
etc.

Using your field definition, this:
http://localhost:8983/solr/select?wt=json&q=ab\ cd\
ef&debug=query&defType=edismax&qf=name eoe
produced this output..

   - parsedquery_toString: "+(eoe:abcdef | (name:ab name:cd name:ef))",


where the field "eoe" is your isbn_issn type.

Best,
Erick


On Mon, Aug 26, 2013 at 4:55 AM, Andrea Gazzarini <
andrea.gazzarini@gmail.com> wrote:

> Hi Erick,
> escaping spaces doesn't work...
>
> Briefly,
>
> - In a document I have an ISBN field that (stored value) is
> *978-90-04-23560-1*
> - In the index I have this value: *9789004235601*
>
> Now, I want be able to search the document by using:
>
> 1) q=*978-90-04-23560-1*
> 2) q=*978 90 04 23560 1*
> 3) q=*9789004235601*
>
> 1 and 3 works perfectly, 2 doesn't work.
>
> My code is:
>
> /SolrQuery query = new SolrQuery(ClientUtils.**escapeQueryChars(req.**
> getParameter("q")));/
>
> isbn is declared in this way
>
> <fieldtype name="isbn_issn" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer>
>         <tokenizer class="*solr.**KeywordTokenizerFactory*"/>
>
>         <filter class="solr.**LowerCaseFilterFactory"/>
>         <filter class="solr.**WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"/>
>     </analyzer>
> </fieldtype>
> <field name="isbn_issn_search" type="issn_isbn" indexed="true"/>
> search handler is:
>
>     <requestHandler name="any_bc" class="solr.SearchHandler"
> default="true">
>         <lst name="defaults">
>             <str name="defType">*dismax*</str>
>
>             <str name="mm">100%</str>
>             <str name="qf">
> *isbn_issn_search*^10000
>             </str>
>             <str name="pf">
> *isbn_issn_search*^100000
>             </str>
>             <int name="ps">0</int>
>             <float name="tie">0.1</float>
>             ...
>     </requestHandler>
>
> This is what I get:
>
> *1) 978-90-04-23560-1**
> *path=/select params={start=0&q=*978\-90\-**04\-23560\-1*&sfield=&qt=any_*
> *bc&wt=javabin&rows=10&version=**2} *hits=1* status=0 QTime=5*
>
> 2) ***9789004235601*
> *webapp=/solr path=/select params={start=0&q=***
> 9789004235601*&sfield=&qt=any_**bc&wt=javabin&rows=10&version=**2}
> *hits=1* status=0 QTime=5*
>
> 3) **978 90 04 23560 1**
> *path=/select params={start=0&*q=978\+90\+**04\+23560\+1*&sfield=&qt=any_*
> *bc&wt=javabin&rows=10&version=**2} *hits=0 *status=0 QTime=2*
>
> *Extract from queryDebug=true:
>
> <str name="q">978\ 90\ 04\ 23560\ 1</str>
> ...
> <str name="rawquerystring">978\ 90\ 04\ 23560\ 1</str>
> <str name="querystring">978\ 90\ 04\ 23560\ 1</str>
> ...
> <str name="parsedquery">
>     +((DisjunctionMaxQuery((isbn_**issn_search:*978*^10000.0)~0.**1)
>     DisjunctionMaxQuery((isbn_**issn_search:*90*^10000.0)~0.1)
>     DisjunctionMaxQuery((isbn_**issn_search:*04*^10000.0)~0.1)
>     DisjunctionMaxQuery((isbn_**issn_search:*23560*^10000.0)~**0.1)
>     DisjunctionMaxQuery((isbn_**issn_search:*1*^10000.0)~0.1))**~5)
>     DisjunctionMaxQuery((isbn_**issn_search:*9789004235601*^**
> 100000.0)~0.1)
> </str>
>
> ------------------------------**------------------
> Probably this is a very stupid question but I'm going crazy. In this page
>
> http://wiki.apache.org/solr/**DisMaxQParserPlugin<http://wiki.apache.org/solr/DisMaxQParserPlugin>
>
> *Query Structure*
>
> /For each "word" in the query string, dismax builds a DisjunctionMaxQuery
> object for that word across all of the fields in the //qf//param...
>
> /And seems exactly what it is doing...but what is a "word"? How can I
> force//(without using double quotes) spaces in a way that they are
> considered part of the word/?
>
> /Many many many thanks
> Andrea
>
>
> On 08/13/2013 04:18 PM, Erick Erickson wrote:
>
>> I think you can get what you want by escaping the space with a
>> backslash....
>>
>> YMMV of course.
>> Erick
>>
>>
>> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
>> andrea.gazzarini@gmail.com> wrote:
>>
>>  Hi Erick,
>>> sorry if that wasn't clear: this is what I'm actually observing in my
>>> application.
>>>
>>> I wrote the first post after looking at the explain (debugQuery=true):
>>> the
>>> query
>>>
>>> q=mag 778 G 69
>>>
>>> is translated as follow:
>>>
>>>
>>> /  +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1)
>>>        DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1)
>>>        DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1)
>>>        DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4)
>>>        DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.**
>>> **0)~0.1)/
>>>
>>> It seems that althouhg I declare myfield with this type
>>>
>>> /<fieldtype name="type1" class="solr.TextField" >
>>>
>>>      <analyzer>
>>>          <tokenizer class="solr.****KeywordTokenizerFactory*" />
>>>
>>>          <filter class="solr.****LowerCaseFilterFactory" />
>>>          <filter class="solr.****WordDelimiterFilterFactory"
>>> generateWordParts="0" generateNumberParts="0"
>>>              catenateWords="0" catenateNumbers="0" catenateAll="1"****
>>> splitOnCaseChange="0"
>>>
>>> />
>>>      </analyzer>
>>> </fieldtype>
>>>
>>> /SOLR is tokenizing it therefore by producing several tokens
>>> (mag,778,g,69)/
>>> /
>>>
>>> And I can't put double quotes on the query (q="mag 778 G 69") because the
>>> request handler searches also in other fields (with different
>>> configuration
>>> chains)
>>>
>>> As I understood the query parser, (i.e. query time), does a whitespace
>>> tokenization on its own before invoking my (query-time) chain. The same
>>> doesn't happen at index time...this is my problem...because at index time
>>> the field is analyzed exactly as I want...but unfortunately cannot say
>>> the
>>> same at query time.
>>>
>>> Sorry for my wonderful english, did you get the point?
>>>
>>>
>>> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>>>
>>>  On a quick scan I don't see a problem here. Attach
>>>> &debug=query to your url and that'll show you the
>>>> parsed query, which will in turn show you what's been
>>>> pushed through the analysis chain you've defined.
>>>>
>>>> You haven't stated whether you've tried this and it's
>>>> not working or you're looking for guidance as to how
>>>> to accomplish this so it's a little unclear how to
>>>> respond.
>>>>
>>>> BTW, the admin/analysis page is your friend here....
>>>>
>>>> Best
>>>> Erick
>>>>
>>>>
>>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>>>> andrea.gazzarini@gmail.com> wrote:
>>>>
>>>>   Clear, thanks for response.
>>>>
>>>>> So, if I have two fields
>>>>>
>>>>> <fieldtype name="type1" class="solr.TextField" >
>>>>>       <analyzer>
>>>>>           <tokenizer class="solr.******KeywordTokenizerFactory*" />
>>>>>
>>>>>           <filter class="solr.******LowerCaseFilterFactory" />
>>>>>           <filter class="solr.******WordDelimiterFilterFactory"
>>>>>
>>>>> generateWordParts="0" generateNumberParts="0"
>>>>>               catenateWords="0" catenateNumbers="0" catenateAll="1"
>>>>> splitOnCaseChange="0" />
>>>>>       </analyzer>
>>>>> </fieldtype>
>>>>> <fieldtype name="type2" class="solr.TextField" >
>>>>>       <analyzer>
>>>>>           <charFilter class="solr.******MappingCharFilterFactory"
>>>>> mapping="mapping-FoldToASCII.******txt"/>
>>>>>           <tokenizer class="solr.******WhitespaceTokenizerFactory" />
>>>>>           <filter class="solr.******LowerCaseFilterFactory" />
>>>>>           <filter class="solr.******WordDelimiterFilterFactory" .../>
>>>>>
>>>>>
>>>>>       </analyzer>
>>>>> </fieldtype>
>>>>>
>>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>>>>> type ends with several tokens)
>>>>>
>>>>> And I want to use the same request handler to query against both of
>>>>> them.
>>>>> I mean I want the user search something like
>>>>>
>>>>> http//..../search?q=Mag 78 D 99
>>>>>
>>>>> and this search should search within both the first (with type1) and
>>>>> second (with type 2) by matching
>>>>>
>>>>> - a document which has field_with_type1 equals to *mag78d99* or
>>>>> - a document which has field_with_type2 that contains a text like "go
>>>>> to
>>>>> *mag 78*, class *d* and subclass *99*)
>>>>>
>>>>>
>>>>> <requestHandler ....>
>>>>>       ...
>>>>>       <str name="defType">dismax</str>
>>>>>       ...
>>>>>       <str name="mm">100%</str>
>>>>>       <str name="qf">
>>>>>           field_with_type1
>>>>>           field_with_type_2
>>>>>       </str>
>>>>>       ...
>>>>> </requestHandler>
>>>>>
>>>>> is not possible? If so, is possible to do that in some other way?
>>>>>
>>>>> Sorry for the long email and thanks again
>>>>> Andrea
>>>>>
>>>>>
>>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>>>
>>>>>   Quoted phrases will be passed to the analyzer as one string, so
>>>>> there a
>>>>>
>>>>>> white space tokenizer is needed.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> -----Original Message----- From: Andrea Gazzarini
>>>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: Re: Tokenization at query time
>>>>>>
>>>>>> Hi Tanguy,
>>>>>> thanks for fast response. What you are saying corresponds perfectly
>>>>>> with
>>>>>> the behaviour I'm observing.
>>>>>> Now, other than having a big problem (I have several other fields both
>>>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>>>
>>>>>> /"The query parser splits the input query on white spaces, and the
>>>>>> each
>>>>>> token is analysed according to your configuration"//
>>>>>> /
>>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>>>> tokenized) what is its effect?
>>>>>>
>>>>>> Thank you very much for the help
>>>>>> Andrea
>>>>>>
>>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>>>
>>>>>>   Hello Andrea,
>>>>>>
>>>>>>> I think you face a rather common issue involving keyword tokenization
>>>>>>> and query parsing in Lucene:
>>>>>>> The query parser splits the input query on white spaces, and then
>>>>>>> each
>>>>>>> token is analysed according to your configuration.
>>>>>>> So those queries with a whitespace won't behave as expected because
>>>>>>> each
>>>>>>> token is analysed separately. Consequently, the catenated version of
>>>>>>> the
>>>>>>> reference cannot be generated.
>>>>>>> I think you could try surrounding your query with double quotes or
>>>>>>> escaping the space characters in your query using a backslash so that
>>>>>>> the
>>>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>>>> occurs.
>>>>>>> You should be aware that this approach has a drawback: you will
>>>>>>> probably
>>>>>>> not be able to combine the search for Mag. 778 G 69 with other words
>>>>>>> in
>>>>>>> other fields unless you are able to identify which spaces are to be
>>>>>>> escaped:
>>>>>>> For example, if input the query is:
>>>>>>> Awesome Mag. 778 G 69
>>>>>>> you would want to transform it to:
>>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>>>>> or
>>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>>>>> query
>>>>>>>
>>>>>>> Do you get the point?
>>>>>>>
>>>>>>> Look at the differences between what you tried and the following
>>>>>>> examples which should all do what you want:
>>>>>>> http://localhost:8983/solr/******collection1/select?q=%22Mag.%******
>>>>>>> 20778%20G%2069%22&debugQuery=******on&qf=text%20myfield&**defType=**
>>>>>>> **dismax<http://localhost:****8983/solr/collection1/select?****
>>>>>>> q=%22Mag.%20778%20G%2069%22&****debugQuery=on&qf=text%**
>>>>>>> 20myfield&defType=dismax<http:**//localhost:8983/solr/**
>>>>>>> collection1/select?q=%22Mag.%**20778%20G%2069%22&debugQuery=**
>>>>>>> on&qf=text%20myfield&defType=**dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>>>> >
>>>>>>> OR
>>>>>>>
>>>>>>> http://localhost:8983/solr/******collection1/select?q=myfield:**
>>>>>>> ****Mag<http://localhost:8983/solr/****collection1/select?q=myfield:****Mag>
>>>>>>> <http://localhost:8983/**solr/**collection1/select?q=**myfield:**Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>>>> >
>>>>>>> <http://localhost:8983/**solr/**collection1/select?q=****myfield:Mag<http://localhost:8983/**solr/collection1/select?q=**myfield:Mag>
>>>>>>> <http://localhost:**8983/solr/collection1/select?**q=myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>>>> >
>>>>>>> .\%20778\%20G\%2069&******debugQuery=on
>>>>>>> OR
>>>>>>> http://localhost:8983/solr/******collection1/select?q=Mag<http://localhost:8983/solr/****collection1/select?q=Mag>
>>>>>>> <http**://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>>>> >
>>>>>>> <http:**//localhost:8983/solr/****collection1/select?q=Mag<htt**
>>>>>>> p://localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>>>> >
>>>>>>> .\%**20778\%20G\%2069&****debugQuery=**on&qf=text%**
>>>>>>>
>>>>>>> 20myfield&defType=**edismax
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I hope this helps
>>>>>>>
>>>>>>> Tanguy
>>>>>>>
>>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>>
>>>>>>>    Hi all,
>>>>>>>
>>>>>>>  I have a field (among others)in my schema defined like this:
>>>>>>>>
>>>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>>>> positionIncrementGap="100">
>>>>>>>>       <analyzer>
>>>>>>>>           <tokenizer class="solr.*******KeywordTokenizerFactory*"
>>>>>>>> />
>>>>>>>>           <filter class="solr.******LowerCaseFilterFactory" />
>>>>>>>>           <filter class="solr.******WordDelimiterFilterFactory"
>>>>>>>>
>>>>>>>>               generateWordParts="0"
>>>>>>>>               generateNumberParts="0"
>>>>>>>>               catenateWords="0"
>>>>>>>>               catenateNumbers="0"
>>>>>>>>               catenateAll="1"
>>>>>>>>               splitOnCaseChange="0" />
>>>>>>>>       </analyzer>
>>>>>>>> </fieldtype>
>>>>>>>>
>>>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>>>
>>>>>>>> Basically, both at index and query time the field value is
>>>>>>>> normalized
>>>>>>>> like this.
>>>>>>>>
>>>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>>>
>>>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>>>> fossero solo le sue le gambe
>>>>>>>>
>>>>>>>> <requestHandler ....>
>>>>>>>>       ...
>>>>>>>>       <str name="defType">dismax</str>
>>>>>>>>       ...
>>>>>>>>       <str name="mm">100%</str>
>>>>>>>>       <str name="qf">myfield^3000</str>
>>>>>>>>       <str name="pf">myfield^30000</str>
>>>>>>>>
>>>>>>>> </requestHandler>
>>>>>>>>
>>>>>>>> What I'm expecting is that if I index a document with a value for my
>>>>>>>> field "Mag. 778 G 69", I will be able to get this document by
>>>>>>>> querying
>>>>>>>>
>>>>>>>> 1. Mag. 778 G 69
>>>>>>>> 2. mag 778 g69
>>>>>>>> 3. mag778g69
>>>>>>>>
>>>>>>>> But that doesn't wotk: i'm able to get the document only and if
>>>>>>>> only I
>>>>>>>> use the "normalized2 form: mag778g69
>>>>>>>>
>>>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>>>> soemthign like
>>>>>>>> this:
>>>>>>>> /
>>>>>>>> // +((DisjunctionMaxQuery((//******myfield://*mag*//^3000.0)~0.1)
>>>>>>>> DisjunctionMaxQuery((//******myfield://*778*//^3000.0)~0.1)
>>>>>>>> DisjunctionMaxQuery((//******myfield://*g*//^3000.0)~0.1)
>>>>>>>> DisjunctionMaxQuery((//******myfield://*69*//^3000.0)~0.1))**
>>>>>>>> ****~4)
>>>>>>>> DisjunctionMaxQuery((//******myfield://*mag778g69*//^30000.**
>>>>>>>> ****0)~0.1)/
>>>>>>>>
>>>>>>>>
>>>>>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>>>>>> 69) and obviously querying the field for separate tokens doesn't
>>>>>>>> match
>>>>>>>> anything (at least this is what I think)
>>>>>>>>
>>>>>>>> Does anybody could please explain me that?
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>> Andrea
>>>>>>>>
>>>>>>>>
>>>>>>>>
>

Re: Tokenization at query time

Posted by Andrea Gazzarini <an...@gmail.com>.

Hi Erick,
escaping spaces doesn't work...

Briefly,

- In a document I have an ISBN field that (stored value) is 
*978-90-04-23560-1*
- In the index I have this value: *9789004235601*

Now, I want be able to search the document by using:

1) q=*978-90-04-23560-1*
2) q=*978 90 04 23560 1*
3) q=*9789004235601*

1 and 3 works perfectly, 2 doesn't work.

My code is:

/SolrQuery query = new 
SolrQuery(ClientUtils.escapeQueryChars(req.getParameter("q")));/

isbn is declared in this way

<fieldtype name="isbn_issn" class="solr.TextField" 
positionIncrementGap="100">
     <analyzer>
         <tokenizer class="*solr.KeywordTokenizerFactory*"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="0" generateNumberParts="0" catenateWords="0" 
catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"/>
     </analyzer>
</fieldtype>
<field name="isbn_issn_search" type="issn_isbn" indexed="true"/>
search handler is:

     <requestHandler name="any_bc" class="solr.SearchHandler" 
default="true">
         <lst name="defaults">
             <str name="defType">*dismax*</str>
             <str name="mm">100%</str>
             <str name="qf">
*isbn_issn_search*^10000
             </str>
             <str name="pf">
*isbn_issn_search*^100000
             </str>
             <int name="ps">0</int>
             <float name="tie">0.1</float>
             ...
     </requestHandler>

This is what I get:

*1) 978-90-04-23560-1**
*path=/select 
params={start=0&q=*978\-90\-04\-23560\-1*&sfield=&qt=any_bc&wt=javabin&rows=10&version=2} 
*hits=1* status=0 QTime=5*

2) ***9789004235601*
*webapp=/solr path=/select 
params={start=0&q=*9789004235601*&sfield=&qt=any_bc&wt=javabin&rows=10&version=2} 
*hits=1* status=0 QTime=5*

3) **978 90 04 23560 1**
*path=/select 
params={start=0&*q=978\+90\+04\+23560\+1*&sfield=&qt=any_bc&wt=javabin&rows=10&version=2} 
*hits=0 *status=0 QTime=2*

*Extract from queryDebug=true:

<str name="q">978\ 90\ 04\ 23560\ 1</str>
...
<str name="rawquerystring">978\ 90\ 04\ 23560\ 1</str>
<str name="querystring">978\ 90\ 04\ 23560\ 1</str>
...
<str name="parsedquery">
     +((DisjunctionMaxQuery((isbn_issn_search:*978*^10000.0)~0.1)
     DisjunctionMaxQuery((isbn_issn_search:*90*^10000.0)~0.1)
     DisjunctionMaxQuery((isbn_issn_search:*04*^10000.0)~0.1)
     DisjunctionMaxQuery((isbn_issn_search:*23560*^10000.0)~0.1)
     DisjunctionMaxQuery((isbn_issn_search:*1*^10000.0)~0.1))~5)
     DisjunctionMaxQuery((isbn_issn_search:*9789004235601*^100000.0)~0.1)
</str>

------------------------------------------------
Probably this is a very stupid question but I'm going crazy. In this page

http://wiki.apache.org/solr/DisMaxQParserPlugin

*Query Structure*

/For each "word" in the query string, dismax builds a 
DisjunctionMaxQuery object for that word across all of the fields in the 
//qf//param...

/And seems exactly what it is doing...but what is a "word"? How can I 
force//(without using double quotes) spaces in a way that they are 
considered part of the word/?

/Many many many thanks
Andrea

On 08/13/2013 04:18 PM, Erick Erickson wrote:
> I think you can get what you want by escaping the space with a backslash....
>
> YMMV of course.
> Erick
>
>
> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
> andrea.gazzarini@gmail.com> wrote:
>
>> Hi Erick,
>> sorry if that wasn't clear: this is what I'm actually observing in my
>> application.
>>
>> I wrote the first post after looking at the explain (debugQuery=true): the
>> query
>>
>> q=mag 778 G 69
>>
>> is translated as follow:
>>
>>
>> /  +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4)
>>        DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/
>>
>> It seems that althouhg I declare myfield with this type
>>
>> /<fieldtype name="type1" class="solr.TextField" >
>>
>>      <analyzer>
>>          <tokenizer class="solr.**KeywordTokenizerFactory*" />
>>
>>          <filter class="solr.**LowerCaseFilterFactory" />
>>          <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0"
>>              catenateWords="0" catenateNumbers="0" catenateAll="1"**splitOnCaseChange="0"
>> />
>>      </analyzer>
>> </fieldtype>
>>
>> /SOLR is tokenizing it therefore by producing several tokens
>> (mag,778,g,69)/
>> /
>>
>> And I can't put double quotes on the query (q="mag 778 G 69") because the
>> request handler searches also in other fields (with different configuration
>> chains)
>>
>> As I understood the query parser, (i.e. query time), does a whitespace
>> tokenization on its own before invoking my (query-time) chain. The same
>> doesn't happen at index time...this is my problem...because at index time
>> the field is analyzed exactly as I want...but unfortunately cannot say the
>> same at query time.
>>
>> Sorry for my wonderful english, did you get the point?
>>
>>
>> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>>
>>> On a quick scan I don't see a problem here. Attach
>>> &debug=query to your url and that'll show you the
>>> parsed query, which will in turn show you what's been
>>> pushed through the analysis chain you've defined.
>>>
>>> You haven't stated whether you've tried this and it's
>>> not working or you're looking for guidance as to how
>>> to accomplish this so it's a little unclear how to
>>> respond.
>>>
>>> BTW, the admin/analysis page is your friend here....
>>>
>>> Best
>>> Erick
>>>
>>>
>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>>> andrea.gazzarini@gmail.com> wrote:
>>>
>>>   Clear, thanks for response.
>>>> So, if I have two fields
>>>>
>>>> <fieldtype name="type1" class="solr.TextField" >
>>>>       <analyzer>
>>>>           <tokenizer class="solr.****KeywordTokenizerFactory*" />
>>>>
>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>           <filter class="solr.****WordDelimiterFilterFactory"
>>>>
>>>> generateWordParts="0" generateNumberParts="0"
>>>>               catenateWords="0" catenateNumbers="0" catenateAll="1"
>>>> splitOnCaseChange="0" />
>>>>       </analyzer>
>>>> </fieldtype>
>>>> <fieldtype name="type2" class="solr.TextField" >
>>>>       <analyzer>
>>>>           <charFilter class="solr.****MappingCharFilterFactory"
>>>> mapping="mapping-FoldToASCII.****txt"/>
>>>>           <tokenizer class="solr.****WhitespaceTokenizerFactory" />
>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>           <filter class="solr.****WordDelimiterFilterFactory" .../>
>>>>
>>>>       </analyzer>
>>>> </fieldtype>
>>>>
>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>>>> type ends with several tokens)
>>>>
>>>> And I want to use the same request handler to query against both of them.
>>>> I mean I want the user search something like
>>>>
>>>> http//..../search?q=Mag 78 D 99
>>>>
>>>> and this search should search within both the first (with type1) and
>>>> second (with type 2) by matching
>>>>
>>>> - a document which has field_with_type1 equals to *mag78d99* or
>>>> - a document which has field_with_type2 that contains a text like "go to
>>>> *mag 78*, class *d* and subclass *99*)
>>>>
>>>>
>>>> <requestHandler ....>
>>>>       ...
>>>>       <str name="defType">dismax</str>
>>>>       ...
>>>>       <str name="mm">100%</str>
>>>>       <str name="qf">
>>>>           field_with_type1
>>>>           field_with_type_2
>>>>       </str>
>>>>       ...
>>>> </requestHandler>
>>>>
>>>> is not possible? If so, is possible to do that in some other way?
>>>>
>>>> Sorry for the long email and thanks again
>>>> Andrea
>>>>
>>>>
>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>>
>>>>   Quoted phrases will be passed to the analyzer as one string, so there a
>>>>> white space tokenizer is needed.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> -----Original Message----- From: Andrea Gazzarini
>>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Tokenization at query time
>>>>>
>>>>> Hi Tanguy,
>>>>> thanks for fast response. What you are saying corresponds perfectly with
>>>>> the behaviour I'm observing.
>>>>> Now, other than having a big problem (I have several other fields both
>>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>>
>>>>> /"The query parser splits the input query on white spaces, and the each
>>>>> token is analysed according to your configuration"//
>>>>> /
>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>>> tokenized) what is its effect?
>>>>>
>>>>> Thank you very much for the help
>>>>> Andrea
>>>>>
>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>>
>>>>>   Hello Andrea,
>>>>>> I think you face a rather common issue involving keyword tokenization
>>>>>> and query parsing in Lucene:
>>>>>> The query parser splits the input query on white spaces, and then each
>>>>>> token is analysed according to your configuration.
>>>>>> So those queries with a whitespace won't behave as expected because
>>>>>> each
>>>>>> token is analysed separately. Consequently, the catenated version of
>>>>>> the
>>>>>> reference cannot be generated.
>>>>>> I think you could try surrounding your query with double quotes or
>>>>>> escaping the space characters in your query using a backslash so that
>>>>>> the
>>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>>> occurs.
>>>>>> You should be aware that this approach has a drawback: you will
>>>>>> probably
>>>>>> not be able to combine the search for Mag. 778 G 69 with other words in
>>>>>> other fields unless you are able to identify which spaces are to be
>>>>>> escaped:
>>>>>> For example, if input the query is:
>>>>>> Awesome Mag. 778 G 69
>>>>>> you would want to transform it to:
>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>>>> or
>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>>>> query
>>>>>>
>>>>>> Do you get the point?
>>>>>>
>>>>>> Look at the differences between what you tried and the following
>>>>>> examples which should all do what you want:
>>>>>> http://localhost:8983/solr/****collection1/select?q=%22Mag.%****
>>>>>> 20778%20G%2069%22&debugQuery=****on&qf=text%20myfield&defType=**
>>>>>> **dismax<http://localhost:**8983/solr/collection1/select?**
>>>>>> q=%22Mag.%20778%20G%2069%22&**debugQuery=on&qf=text%**
>>>>>> 20myfield&defType=dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>>> OR
>>>>>> http://localhost:8983/solr/****collection1/select?q=myfield:****Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>>> <http://localhost:8983/**solr/collection1/select?q=**myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>>> .\%20778\%20G\%2069&****debugQuery=on
>>>>>> OR
>>>>>> http://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>>> <http:**//localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>>> .\%**20778\%20G\%2069&**debugQuery=**on&qf=text%**
>>>>>> 20myfield&defType=**edismax
>>>>>>
>>>>>>
>>>>>>
>>>>>> I hope this helps
>>>>>>
>>>>>> Tanguy
>>>>>>
>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>
>>>>>>    Hi all,
>>>>>>
>>>>>>> I have a field (among others)in my schema defined like this:
>>>>>>>
>>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>>> positionIncrementGap="100">
>>>>>>>       <analyzer>
>>>>>>>           <tokenizer class="solr.*****KeywordTokenizerFactory*" />
>>>>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>>>>           <filter class="solr.****WordDelimiterFilterFactory"
>>>>>>>
>>>>>>>               generateWordParts="0"
>>>>>>>               generateNumberParts="0"
>>>>>>>               catenateWords="0"
>>>>>>>               catenateNumbers="0"
>>>>>>>               catenateAll="1"
>>>>>>>               splitOnCaseChange="0" />
>>>>>>>       </analyzer>
>>>>>>> </fieldtype>
>>>>>>>
>>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>>
>>>>>>> Basically, both at index and query time the field value is normalized
>>>>>>> like this.
>>>>>>>
>>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>>
>>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>>> fossero solo le sue le gambe
>>>>>>>
>>>>>>> <requestHandler ....>
>>>>>>>       ...
>>>>>>>       <str name="defType">dismax</str>
>>>>>>>       ...
>>>>>>>       <str name="mm">100%</str>
>>>>>>>       <str name="qf">myfield^3000</str>
>>>>>>>       <str name="pf">myfield^30000</str>
>>>>>>>
>>>>>>> </requestHandler>
>>>>>>>
>>>>>>> What I'm expecting is that if I index a document with a value for my
>>>>>>> field "Mag. 778 G 69", I will be able to get this document by querying
>>>>>>>
>>>>>>> 1. Mag. 778 G 69
>>>>>>> 2. mag 778 g69
>>>>>>> 3. mag778g69
>>>>>>>
>>>>>>> But that doesn't wotk: i'm able to get the document only and if only I
>>>>>>> use the "normalized2 form: mag778g69
>>>>>>>
>>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>>> soemthign like
>>>>>>> this:
>>>>>>> /
>>>>>>> // +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4)
>>>>>>> DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.****0)~0.1)/
>>>>>>>
>>>>>>>
>>>>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>>>>> 69) and obviously querying the field for separate tokens doesn't match
>>>>>>> anything (at least this is what I think)
>>>>>>>
>>>>>>> Does anybody could please explain me that?
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>> Andrea
>>>>>>>
>>>>>>>

Re: Tokenization at query time

Posted by Andrea Gazzarini <an...@gmail.com>.

Trying...thank you very much!

I'll let you know

Best,
Andrea

On 08/13/2013 04:18 PM, Erick Erickson wrote:
> I think you can get what you want by escaping the space with a backslash....
>
> YMMV of course.
> Erick
>
>
> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
> andrea.gazzarini@gmail.com> wrote:
>
>> Hi Erick,
>> sorry if that wasn't clear: this is what I'm actually observing in my
>> application.
>>
>> I wrote the first post after looking at the explain (debugQuery=true): the
>> query
>>
>> q=mag 778 G 69
>>
>> is translated as follow:
>>
>>
>> /  +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4)
>>        DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/
>>
>> It seems that althouhg I declare myfield with this type
>>
>> /<fieldtype name="type1" class="solr.TextField" >
>>
>>      <analyzer>
>>          <tokenizer class="solr.**KeywordTokenizerFactory*" />
>>
>>          <filter class="solr.**LowerCaseFilterFactory" />
>>          <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0"
>>              catenateWords="0" catenateNumbers="0" catenateAll="1"**splitOnCaseChange="0"
>> />
>>      </analyzer>
>> </fieldtype>
>>
>> /SOLR is tokenizing it therefore by producing several tokens
>> (mag,778,g,69)/
>> /
>>
>> And I can't put double quotes on the query (q="mag 778 G 69") because the
>> request handler searches also in other fields (with different configuration
>> chains)
>>
>> As I understood the query parser, (i.e. query time), does a whitespace
>> tokenization on its own before invoking my (query-time) chain. The same
>> doesn't happen at index time...this is my problem...because at index time
>> the field is analyzed exactly as I want...but unfortunately cannot say the
>> same at query time.
>>
>> Sorry for my wonderful english, did you get the point?
>>
>>
>> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>>
>>> On a quick scan I don't see a problem here. Attach
>>> &debug=query to your url and that'll show you the
>>> parsed query, which will in turn show you what's been
>>> pushed through the analysis chain you've defined.
>>>
>>> You haven't stated whether you've tried this and it's
>>> not working or you're looking for guidance as to how
>>> to accomplish this so it's a little unclear how to
>>> respond.
>>>
>>> BTW, the admin/analysis page is your friend here....
>>>
>>> Best
>>> Erick
>>>
>>>
>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>>> andrea.gazzarini@gmail.com> wrote:
>>>
>>>   Clear, thanks for response.
>>>> So, if I have two fields
>>>>
>>>> <fieldtype name="type1" class="solr.TextField" >
>>>>       <analyzer>
>>>>           <tokenizer class="solr.****KeywordTokenizerFactory*" />
>>>>
>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>           <filter class="solr.****WordDelimiterFilterFactory"
>>>>
>>>> generateWordParts="0" generateNumberParts="0"
>>>>               catenateWords="0" catenateNumbers="0" catenateAll="1"
>>>> splitOnCaseChange="0" />
>>>>       </analyzer>
>>>> </fieldtype>
>>>> <fieldtype name="type2" class="solr.TextField" >
>>>>       <analyzer>
>>>>           <charFilter class="solr.****MappingCharFilterFactory"
>>>> mapping="mapping-FoldToASCII.****txt"/>
>>>>           <tokenizer class="solr.****WhitespaceTokenizerFactory" />
>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>           <filter class="solr.****WordDelimiterFilterFactory" .../>
>>>>
>>>>       </analyzer>
>>>> </fieldtype>
>>>>
>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>>>> type ends with several tokens)
>>>>
>>>> And I want to use the same request handler to query against both of them.
>>>> I mean I want the user search something like
>>>>
>>>> http//..../search?q=Mag 78 D 99
>>>>
>>>> and this search should search within both the first (with type1) and
>>>> second (with type 2) by matching
>>>>
>>>> - a document which has field_with_type1 equals to *mag78d99* or
>>>> - a document which has field_with_type2 that contains a text like "go to
>>>> *mag 78*, class *d* and subclass *99*)
>>>>
>>>>
>>>> <requestHandler ....>
>>>>       ...
>>>>       <str name="defType">dismax</str>
>>>>       ...
>>>>       <str name="mm">100%</str>
>>>>       <str name="qf">
>>>>           field_with_type1
>>>>           field_with_type_2
>>>>       </str>
>>>>       ...
>>>> </requestHandler>
>>>>
>>>> is not possible? If so, is possible to do that in some other way?
>>>>
>>>> Sorry for the long email and thanks again
>>>> Andrea
>>>>
>>>>
>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>>
>>>>   Quoted phrases will be passed to the analyzer as one string, so there a
>>>>> white space tokenizer is needed.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> -----Original Message----- From: Andrea Gazzarini
>>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Tokenization at query time
>>>>>
>>>>> Hi Tanguy,
>>>>> thanks for fast response. What you are saying corresponds perfectly with
>>>>> the behaviour I'm observing.
>>>>> Now, other than having a big problem (I have several other fields both
>>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>>
>>>>> /"The query parser splits the input query on white spaces, and the each
>>>>> token is analysed according to your configuration"//
>>>>> /
>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>>> tokenized) what is its effect?
>>>>>
>>>>> Thank you very much for the help
>>>>> Andrea
>>>>>
>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>>
>>>>>   Hello Andrea,
>>>>>> I think you face a rather common issue involving keyword tokenization
>>>>>> and query parsing in Lucene:
>>>>>> The query parser splits the input query on white spaces, and then each
>>>>>> token is analysed according to your configuration.
>>>>>> So those queries with a whitespace won't behave as expected because
>>>>>> each
>>>>>> token is analysed separately. Consequently, the catenated version of
>>>>>> the
>>>>>> reference cannot be generated.
>>>>>> I think you could try surrounding your query with double quotes or
>>>>>> escaping the space characters in your query using a backslash so that
>>>>>> the
>>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>>> occurs.
>>>>>> You should be aware that this approach has a drawback: you will
>>>>>> probably
>>>>>> not be able to combine the search for Mag. 778 G 69 with other words in
>>>>>> other fields unless you are able to identify which spaces are to be
>>>>>> escaped:
>>>>>> For example, if input the query is:
>>>>>> Awesome Mag. 778 G 69
>>>>>> you would want to transform it to:
>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>>>> or
>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>>>> query
>>>>>>
>>>>>> Do you get the point?
>>>>>>
>>>>>> Look at the differences between what you tried and the following
>>>>>> examples which should all do what you want:
>>>>>> http://localhost:8983/solr/****collection1/select?q=%22Mag.%****
>>>>>> 20778%20G%2069%22&debugQuery=****on&qf=text%20myfield&defType=**
>>>>>> **dismax<http://localhost:**8983/solr/collection1/select?**
>>>>>> q=%22Mag.%20778%20G%2069%22&**debugQuery=on&qf=text%**
>>>>>> 20myfield&defType=dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>>> OR
>>>>>> http://localhost:8983/solr/****collection1/select?q=myfield:****Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>>> <http://localhost:8983/**solr/collection1/select?q=**myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>>> .\%20778\%20G\%2069&****debugQuery=on
>>>>>> OR
>>>>>> http://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>>> <http:**//localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>>> .\%**20778\%20G\%2069&**debugQuery=**on&qf=text%**
>>>>>> 20myfield&defType=**edismax
>>>>>>
>>>>>>
>>>>>>
>>>>>> I hope this helps
>>>>>>
>>>>>> Tanguy
>>>>>>
>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>
>>>>>>    Hi all,
>>>>>>
>>>>>>> I have a field (among others)in my schema defined like this:
>>>>>>>
>>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>>> positionIncrementGap="100">
>>>>>>>       <analyzer>
>>>>>>>           <tokenizer class="solr.*****KeywordTokenizerFactory*" />
>>>>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>>>>           <filter class="solr.****WordDelimiterFilterFactory"
>>>>>>>
>>>>>>>               generateWordParts="0"
>>>>>>>               generateNumberParts="0"
>>>>>>>               catenateWords="0"
>>>>>>>               catenateNumbers="0"
>>>>>>>               catenateAll="1"
>>>>>>>               splitOnCaseChange="0" />
>>>>>>>       </analyzer>
>>>>>>> </fieldtype>
>>>>>>>
>>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>>
>>>>>>> Basically, both at index and query time the field value is normalized
>>>>>>> like this.
>>>>>>>
>>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>>
>>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>>> fossero solo le sue le gambe
>>>>>>>
>>>>>>> <requestHandler ....>
>>>>>>>       ...
>>>>>>>       <str name="defType">dismax</str>
>>>>>>>       ...
>>>>>>>       <str name="mm">100%</str>
>>>>>>>       <str name="qf">myfield^3000</str>
>>>>>>>       <str name="pf">myfield^30000</str>
>>>>>>>
>>>>>>> </requestHandler>
>>>>>>>
>>>>>>> What I'm expecting is that if I index a document with a value for my
>>>>>>> field "Mag. 778 G 69", I will be able to get this document by querying
>>>>>>>
>>>>>>> 1. Mag. 778 G 69
>>>>>>> 2. mag 778 g69
>>>>>>> 3. mag778g69
>>>>>>>
>>>>>>> But that doesn't wotk: i'm able to get the document only and if only I
>>>>>>> use the "normalized2 form: mag778g69
>>>>>>>
>>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>>> soemthign like
>>>>>>> this:
>>>>>>> /
>>>>>>> // +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4)
>>>>>>> DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.****0)~0.1)/
>>>>>>>
>>>>>>>
>>>>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>>>>> 69) and obviously querying the field for separate tokens doesn't match
>>>>>>> anything (at least this is what I think)
>>>>>>>
>>>>>>> Does anybody could please explain me that?
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>> Andrea
>>>>>>>
>>>>>>>

Re: Tokenization at query time

Posted by Erick Erickson <er...@gmail.com>.

I think you can get what you want by escaping the space with a backslash....

YMMV of course.
Erick


On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
andrea.gazzarini@gmail.com> wrote:

> Hi Erick,
> sorry if that wasn't clear: this is what I'm actually observing in my
> application.
>
> I wrote the first post after looking at the explain (debugQuery=true): the
> query
>
> q=mag 778 G 69
>
> is translated as follow:
>
>
> /  +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1)
>       DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1)
>       DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1)
>       DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4)
>       DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/
>
> It seems that althouhg I declare myfield with this type
>
> /<fieldtype name="type1" class="solr.TextField" >
>
>     <analyzer>
>         <tokenizer class="solr.**KeywordTokenizerFactory*" />
>
>         <filter class="solr.**LowerCaseFilterFactory" />
>         <filter class="solr.**WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0"
>             catenateWords="0" catenateNumbers="0" catenateAll="1"**splitOnCaseChange="0"
> />
>     </analyzer>
> </fieldtype>
>
> /SOLR is tokenizing it therefore by producing several tokens
> (mag,778,g,69)/
> /
>
> And I can't put double quotes on the query (q="mag 778 G 69") because the
> request handler searches also in other fields (with different configuration
> chains)
>
> As I understood the query parser, (i.e. query time), does a whitespace
> tokenization on its own before invoking my (query-time) chain. The same
> doesn't happen at index time...this is my problem...because at index time
> the field is analyzed exactly as I want...but unfortunately cannot say the
> same at query time.
>
> Sorry for my wonderful english, did you get the point?
>
>
> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>
>> On a quick scan I don't see a problem here. Attach
>> &debug=query to your url and that'll show you the
>> parsed query, which will in turn show you what's been
>> pushed through the analysis chain you've defined.
>>
>> You haven't stated whether you've tried this and it's
>> not working or you're looking for guidance as to how
>> to accomplish this so it's a little unclear how to
>> respond.
>>
>> BTW, the admin/analysis page is your friend here....
>>
>> Best
>> Erick
>>
>>
>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>> andrea.gazzarini@gmail.com> wrote:
>>
>>  Clear, thanks for response.
>>>
>>> So, if I have two fields
>>>
>>> <fieldtype name="type1" class="solr.TextField" >
>>>      <analyzer>
>>>          <tokenizer class="solr.****KeywordTokenizerFactory*" />
>>>
>>>          <filter class="solr.****LowerCaseFilterFactory" />
>>>          <filter class="solr.****WordDelimiterFilterFactory"
>>>
>>> generateWordParts="0" generateNumberParts="0"
>>>              catenateWords="0" catenateNumbers="0" catenateAll="1"
>>> splitOnCaseChange="0" />
>>>      </analyzer>
>>> </fieldtype>
>>> <fieldtype name="type2" class="solr.TextField" >
>>>      <analyzer>
>>>          <charFilter class="solr.****MappingCharFilterFactory"
>>> mapping="mapping-FoldToASCII.****txt"/>
>>>          <tokenizer class="solr.****WhitespaceTokenizerFactory" />
>>>          <filter class="solr.****LowerCaseFilterFactory" />
>>>          <filter class="solr.****WordDelimiterFilterFactory" .../>
>>>
>>>      </analyzer>
>>> </fieldtype>
>>>
>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>>> type ends with several tokens)
>>>
>>> And I want to use the same request handler to query against both of them.
>>> I mean I want the user search something like
>>>
>>> http//..../search?q=Mag 78 D 99
>>>
>>> and this search should search within both the first (with type1) and
>>> second (with type 2) by matching
>>>
>>> - a document which has field_with_type1 equals to *mag78d99* or
>>> - a document which has field_with_type2 that contains a text like "go to
>>> *mag 78*, class *d* and subclass *99*)
>>>
>>>
>>> <requestHandler ....>
>>>      ...
>>>      <str name="defType">dismax</str>
>>>      ...
>>>      <str name="mm">100%</str>
>>>      <str name="qf">
>>>          field_with_type1
>>>          field_with_type_2
>>>      </str>
>>>      ...
>>> </requestHandler>
>>>
>>> is not possible? If so, is possible to do that in some other way?
>>>
>>> Sorry for the long email and thanks again
>>> Andrea
>>>
>>>
>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>
>>>  Quoted phrases will be passed to the analyzer as one string, so there a
>>>> white space tokenizer is needed.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Andrea Gazzarini
>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Tokenization at query time
>>>>
>>>> Hi Tanguy,
>>>> thanks for fast response. What you are saying corresponds perfectly with
>>>> the behaviour I'm observing.
>>>> Now, other than having a big problem (I have several other fields both
>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>
>>>> /"The query parser splits the input query on white spaces, and the each
>>>> token is analysed according to your configuration"//
>>>> /
>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>> tokenized) what is its effect?
>>>>
>>>> Thank you very much for the help
>>>> Andrea
>>>>
>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>
>>>>  Hello Andrea,
>>>>> I think you face a rather common issue involving keyword tokenization
>>>>> and query parsing in Lucene:
>>>>> The query parser splits the input query on white spaces, and then each
>>>>> token is analysed according to your configuration.
>>>>> So those queries with a whitespace won't behave as expected because
>>>>> each
>>>>> token is analysed separately. Consequently, the catenated version of
>>>>> the
>>>>> reference cannot be generated.
>>>>> I think you could try surrounding your query with double quotes or
>>>>> escaping the space characters in your query using a backslash so that
>>>>> the
>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>> occurs.
>>>>> You should be aware that this approach has a drawback: you will
>>>>> probably
>>>>> not be able to combine the search for Mag. 778 G 69 with other words in
>>>>> other fields unless you are able to identify which spaces are to be
>>>>> escaped:
>>>>> For example, if input the query is:
>>>>> Awesome Mag. 778 G 69
>>>>> you would want to transform it to:
>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>>> or
>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>>> query
>>>>>
>>>>> Do you get the point?
>>>>>
>>>>> Look at the differences between what you tried and the following
>>>>> examples which should all do what you want:
>>>>> http://localhost:8983/solr/****collection1/select?q=%22Mag.%****
>>>>> 20778%20G%2069%22&debugQuery=****on&qf=text%20myfield&defType=**
>>>>> **dismax<http://localhost:**8983/solr/collection1/select?**
>>>>> q=%22Mag.%20778%20G%2069%22&**debugQuery=on&qf=text%**
>>>>> 20myfield&defType=dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>> >
>>>>> OR
>>>>> http://localhost:8983/solr/****collection1/select?q=myfield:****Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>> <http://localhost:8983/**solr/collection1/select?q=**myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>> >
>>>>> .\%20778\%20G\%2069&****debugQuery=on
>>>>> OR
>>>>> http://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>> <http:**//localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>> >
>>>>> .\%**20778\%20G\%2069&**debugQuery=**on&qf=text%**
>>>>> 20myfield&defType=**edismax
>>>>>
>>>>>
>>>>>
>>>>> I hope this helps
>>>>>
>>>>> Tanguy
>>>>>
>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>
>>>>>   Hi all,
>>>>>
>>>>>> I have a field (among others)in my schema defined like this:
>>>>>>
>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>> positionIncrementGap="100">
>>>>>>      <analyzer>
>>>>>>          <tokenizer class="solr.*****KeywordTokenizerFactory*" />
>>>>>>          <filter class="solr.****LowerCaseFilterFactory" />
>>>>>>          <filter class="solr.****WordDelimiterFilterFactory"
>>>>>>
>>>>>>              generateWordParts="0"
>>>>>>              generateNumberParts="0"
>>>>>>              catenateWords="0"
>>>>>>              catenateNumbers="0"
>>>>>>              catenateAll="1"
>>>>>>              splitOnCaseChange="0" />
>>>>>>      </analyzer>
>>>>>> </fieldtype>
>>>>>>
>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>
>>>>>> Basically, both at index and query time the field value is normalized
>>>>>> like this.
>>>>>>
>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>
>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>> fossero solo le sue le gambe
>>>>>>
>>>>>> <requestHandler ....>
>>>>>>      ...
>>>>>>      <str name="defType">dismax</str>
>>>>>>      ...
>>>>>>      <str name="mm">100%</str>
>>>>>>      <str name="qf">myfield^3000</str>
>>>>>>      <str name="pf">myfield^30000</str>
>>>>>>
>>>>>> </requestHandler>
>>>>>>
>>>>>> What I'm expecting is that if I index a document with a value for my
>>>>>> field "Mag. 778 G 69", I will be able to get this document by querying
>>>>>>
>>>>>> 1. Mag. 778 G 69
>>>>>> 2. mag 778 g69
>>>>>> 3. mag778g69
>>>>>>
>>>>>> But that doesn't wotk: i'm able to get the document only and if only I
>>>>>> use the "normalized2 form: mag778g69
>>>>>>
>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>> soemthign like
>>>>>> this:
>>>>>> /
>>>>>> // +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1)
>>>>>> DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1)
>>>>>> DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1)
>>>>>> DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4)
>>>>>> DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.****0)~0.1)/
>>>>>>
>>>>>>
>>>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>>>> 69) and obviously querying the field for separate tokens doesn't match
>>>>>> anything (at least this is what I think)
>>>>>>
>>>>>> Does anybody could please explain me that?
>>>>>>
>>>>>> Thanks in advance
>>>>>> Andrea
>>>>>>
>>>>>>
>>>>>
>

Re: Tokenization at query time

Posted by Andrea Gazzarini <an...@gmail.com>.

Hi Erick,
sorry if that wasn't clear: this is what I'm actually observing in my 
application.

I wrote the first post after looking at the explain (debugQuery=true): 
the query

q=mag 778 G 69

is translated as follow:

/  +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1)
       DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1)
       DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1)
       DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4)
       DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/

It seems that althouhg I declare myfield with this type

/<fieldtype name="type1" class="solr.TextField" >
     <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory*" />

         <filter class="solr.LowerCaseFilterFactory" />
         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0"
             catenateWords="0" catenateNumbers="0" catenateAll="1"splitOnCaseChange="0" />
     </analyzer>
</fieldtype>

/SOLR is tokenizing it therefore by producing several tokens (mag,778,g,69)/
/

And I can't put double quotes on the query (q="mag 778 G 69") because 
the request handler searches also in other fields (with different 
configuration chains)

As I understood the query parser, (i.e. query time), does a whitespace 
tokenization on its own before invoking my (query-time) chain. The same 
doesn't happen at index time...this is my problem...because at index 
time the field is analyzed exactly as I want...but unfortunately cannot 
say the same at query time.

Sorry for my wonderful english, did you get the point?

On 08/13/2013 02:18 PM, Erick Erickson wrote:
> On a quick scan I don't see a problem here. Attach
> &debug=query to your url and that'll show you the
> parsed query, which will in turn show you what's been
> pushed through the analysis chain you've defined.
>
> You haven't stated whether you've tried this and it's
> not working or you're looking for guidance as to how
> to accomplish this so it's a little unclear how to
> respond.
>
> BTW, the admin/analysis page is your friend here....
>
> Best
> Erick
>
>
> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
> andrea.gazzarini@gmail.com> wrote:
>
>> Clear, thanks for response.
>>
>> So, if I have two fields
>>
>> <fieldtype name="type1" class="solr.TextField" >
>>      <analyzer>
>>          <tokenizer class="solr.**KeywordTokenizerFactory*" />
>>
>>          <filter class="solr.**LowerCaseFilterFactory" />
>>          <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0"
>>              catenateWords="0" catenateNumbers="0" catenateAll="1"
>> splitOnCaseChange="0" />
>>      </analyzer>
>> </fieldtype>
>> <fieldtype name="type2" class="solr.TextField" >
>>      <analyzer>
>>          <charFilter class="solr.**MappingCharFilterFactory"
>> mapping="mapping-FoldToASCII.**txt"/>
>>          <tokenizer class="solr.**WhitespaceTokenizerFactory" />
>>          <filter class="solr.**LowerCaseFilterFactory" />
>>          <filter class="solr.**WordDelimiterFilterFactory" .../>
>>      </analyzer>
>> </fieldtype>
>>
>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>> type ends with several tokens)
>>
>> And I want to use the same request handler to query against both of them.
>> I mean I want the user search something like
>>
>> http//..../search?q=Mag 78 D 99
>>
>> and this search should search within both the first (with type1) and
>> second (with type 2) by matching
>>
>> - a document which has field_with_type1 equals to *mag78d99* or
>> - a document which has field_with_type2 that contains a text like "go to
>> *mag 78*, class *d* and subclass *99*)
>>
>>
>> <requestHandler ....>
>>      ...
>>      <str name="defType">dismax</str>
>>      ...
>>      <str name="mm">100%</str>
>>      <str name="qf">
>>          field_with_type1
>>          field_with_type_2
>>      </str>
>>      ...
>> </requestHandler>
>>
>> is not possible? If so, is possible to do that in some other way?
>>
>> Sorry for the long email and thanks again
>> Andrea
>>
>>
>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>
>>> Quoted phrases will be passed to the analyzer as one string, so there a
>>> white space tokenizer is needed.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Andrea Gazzarini
>>> Sent: Monday, August 12, 2013 6:52 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Tokenization at query time
>>>
>>> Hi Tanguy,
>>> thanks for fast response. What you are saying corresponds perfectly with
>>> the behaviour I'm observing.
>>> Now, other than having a big problem (I have several other fields both
>>> in the pf and qf where spaces doesn't matter, field types like the
>>> "text_en" field type in the example schema) what I'm wondering is:
>>>
>>> /"The query parser splits the input query on white spaces, and the each
>>> token is analysed according to your configuration"//
>>> /
>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>> analyzer? If the input query is already parsed (i.e. whitespace
>>> tokenized) what is its effect?
>>>
>>> Thank you very much for the help
>>> Andrea
>>>
>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>
>>>> Hello Andrea,
>>>> I think you face a rather common issue involving keyword tokenization
>>>> and query parsing in Lucene:
>>>> The query parser splits the input query on white spaces, and then each
>>>> token is analysed according to your configuration.
>>>> So those queries with a whitespace won't behave as expected because each
>>>> token is analysed separately. Consequently, the catenated version of the
>>>> reference cannot be generated.
>>>> I think you could try surrounding your query with double quotes or
>>>> escaping the space characters in your query using a backslash so that the
>>>> whole sequence is analysed in the same analyser and the catenation occurs.
>>>> You should be aware that this approach has a drawback: you will probably
>>>> not be able to combine the search for Mag. 778 G 69 with other words in
>>>> other fields unless you are able to identify which spaces are to be escaped:
>>>> For example, if input the query is:
>>>> Awesome Mag. 778 G 69
>>>> you would want to transform it to:
>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>> or
>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>> query
>>>>
>>>> Do you get the point?
>>>>
>>>> Look at the differences between what you tried and the following
>>>> examples which should all do what you want:
>>>> http://localhost:8983/solr/**collection1/select?q=%22Mag.%**
>>>> 20778%20G%2069%22&debugQuery=**on&qf=text%20myfield&defType=**dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>> OR
>>>> http://localhost:8983/solr/**collection1/select?q=myfield:**Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>> .\%20778\%20G\%2069&**debugQuery=on
>>>> OR
>>>> http://localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>> .\%**20778\%20G\%2069&debugQuery=**on&qf=text%20myfield&defType=**edismax
>>>>
>>>>
>>>> I hope this helps
>>>>
>>>> Tanguy
>>>>
>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>> andrea.gazzarini@gmail.com> wrote:
>>>>
>>>>   Hi all,
>>>>> I have a field (among others)in my schema defined like this:
>>>>>
>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>> positionIncrementGap="100">
>>>>>      <analyzer>
>>>>>          <tokenizer class="solr.***KeywordTokenizerFactory*" />
>>>>>          <filter class="solr.**LowerCaseFilterFactory" />
>>>>>          <filter class="solr.**WordDelimiterFilterFactory"
>>>>>              generateWordParts="0"
>>>>>              generateNumberParts="0"
>>>>>              catenateWords="0"
>>>>>              catenateNumbers="0"
>>>>>              catenateAll="1"
>>>>>              splitOnCaseChange="0" />
>>>>>      </analyzer>
>>>>> </fieldtype>
>>>>>
>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>
>>>>> Basically, both at index and query time the field value is normalized
>>>>> like this.
>>>>>
>>>>> Mag. 778 G 69 => mag778g69
>>>>>
>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>> fossero solo le sue le gambe
>>>>>
>>>>> <requestHandler ....>
>>>>>      ...
>>>>>      <str name="defType">dismax</str>
>>>>>      ...
>>>>>      <str name="mm">100%</str>
>>>>>      <str name="qf">myfield^3000</str>
>>>>>      <str name="pf">myfield^30000</str>
>>>>>
>>>>> </requestHandler>
>>>>>
>>>>> What I'm expecting is that if I index a document with a value for my
>>>>> field "Mag. 778 G 69", I will be able to get this document by querying
>>>>>
>>>>> 1. Mag. 778 G 69
>>>>> 2. mag 778 g69
>>>>> 3. mag778g69
>>>>>
>>>>> But that doesn't wotk: i'm able to get the document only and if only I
>>>>> use the "normalized2 form: mag778g69
>>>>>
>>>>> After doing a little bit of debug, I see that, even I used a
>>>>> KeywordTokenizer in my field type declaration, SOLR is doing soemthign like
>>>>> this:
>>>>> /
>>>>> // +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1)
>>>>> DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1)
>>>>> DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1)
>>>>> DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4)
>>>>> DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/
>>>>>
>>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>>> 69) and obviously querying the field for separate tokens doesn't match
>>>>> anything (at least this is what I think)
>>>>>
>>>>> Does anybody could please explain me that?
>>>>>
>>>>> Thanks in advance
>>>>> Andrea
>>>>>
>>>>

Re: Tokenization at query time

Posted by Erick Erickson <er...@gmail.com>.

On a quick scan I don't see a problem here. Attach
&debug=query to your url and that'll show you the
parsed query, which will in turn show you what's been
pushed through the analysis chain you've defined.

You haven't stated whether you've tried this and it's
not working or you're looking for guidance as to how
to accomplish this so it's a little unclear how to
respond.

BTW, the admin/analysis page is your friend here....

Best
Erick


On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
andrea.gazzarini@gmail.com> wrote:

> Clear, thanks for response.
>
> So, if I have two fields
>
> <fieldtype name="type1" class="solr.TextField" >
>     <analyzer>
>         <tokenizer class="solr.**KeywordTokenizerFactory*" />
>
>         <filter class="solr.**LowerCaseFilterFactory" />
>         <filter class="solr.**WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0"
>             catenateWords="0" catenateNumbers="0" catenateAll="1"
> splitOnCaseChange="0" />
>     </analyzer>
> </fieldtype>
> <fieldtype name="type2" class="solr.TextField" >
>     <analyzer>
>         <charFilter class="solr.**MappingCharFilterFactory"
> mapping="mapping-FoldToASCII.**txt"/>
>         <tokenizer class="solr.**WhitespaceTokenizerFactory" />
>         <filter class="solr.**LowerCaseFilterFactory" />
>         <filter class="solr.**WordDelimiterFilterFactory" .../>
>     </analyzer>
> </fieldtype>
>
> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
> type ends with several tokens)
>
> And I want to use the same request handler to query against both of them.
> I mean I want the user search something like
>
> http//..../search?q=Mag 78 D 99
>
> and this search should search within both the first (with type1) and
> second (with type 2) by matching
>
> - a document which has field_with_type1 equals to *mag78d99* or
> - a document which has field_with_type2 that contains a text like "go to
> *mag 78*, class *d* and subclass *99*)
>
>
> <requestHandler ....>
>     ...
>     <str name="defType">dismax</str>
>     ...
>     <str name="mm">100%</str>
>     <str name="qf">
>         field_with_type1
>         field_with_type_2
>     </str>
>     ...
> </requestHandler>
>
> is not possible? If so, is possible to do that in some other way?
>
> Sorry for the long email and thanks again
> Andrea
>
>
> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>
>> Quoted phrases will be passed to the analyzer as one string, so there a
>> white space tokenizer is needed.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Andrea Gazzarini
>> Sent: Monday, August 12, 2013 6:52 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Tokenization at query time
>>
>> Hi Tanguy,
>> thanks for fast response. What you are saying corresponds perfectly with
>> the behaviour I'm observing.
>> Now, other than having a big problem (I have several other fields both
>> in the pf and qf where spaces doesn't matter, field types like the
>> "text_en" field type in the example schema) what I'm wondering is:
>>
>> /"The query parser splits the input query on white spaces, and the each
>> token is analysed according to your configuration"//
>> /
>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>> analyzer? If the input query is already parsed (i.e. whitespace
>> tokenized) what is its effect?
>>
>> Thank you very much for the help
>> Andrea
>>
>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>
>>> Hello Andrea,
>>> I think you face a rather common issue involving keyword tokenization
>>> and query parsing in Lucene:
>>> The query parser splits the input query on white spaces, and then each
>>> token is analysed according to your configuration.
>>> So those queries with a whitespace won't behave as expected because each
>>> token is analysed separately. Consequently, the catenated version of the
>>> reference cannot be generated.
>>> I think you could try surrounding your query with double quotes or
>>> escaping the space characters in your query using a backslash so that the
>>> whole sequence is analysed in the same analyser and the catenation occurs.
>>> You should be aware that this approach has a drawback: you will probably
>>> not be able to combine the search for Mag. 778 G 69 with other words in
>>> other fields unless you are able to identify which spaces are to be escaped:
>>> For example, if input the query is:
>>> Awesome Mag. 778 G 69
>>> you would want to transform it to:
>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>> or
>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>> query
>>>
>>> Do you get the point?
>>>
>>> Look at the differences between what you tried and the following
>>> examples which should all do what you want:
>>> http://localhost:8983/solr/**collection1/select?q=%22Mag.%**
>>> 20778%20G%2069%22&debugQuery=**on&qf=text%20myfield&defType=**dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>> OR
>>> http://localhost:8983/solr/**collection1/select?q=myfield:**Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>> .\%20778\%20G\%2069&**debugQuery=on
>>> OR
>>> http://localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>> .\%**20778\%20G\%2069&debugQuery=**on&qf=text%20myfield&defType=**edismax
>>>
>>>
>>> I hope this helps
>>>
>>> Tanguy
>>>
>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>> andrea.gazzarini@gmail.com> wrote:
>>>
>>>  Hi all,
>>>> I have a field (among others)in my schema defined like this:
>>>>
>>>> <fieldtype name="mytype" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>     <analyzer>
>>>>         <tokenizer class="solr.***KeywordTokenizerFactory*" />
>>>>         <filter class="solr.**LowerCaseFilterFactory" />
>>>>         <filter class="solr.**WordDelimiterFilterFactory"
>>>>             generateWordParts="0"
>>>>             generateNumberParts="0"
>>>>             catenateWords="0"
>>>>             catenateNumbers="0"
>>>>             catenateAll="1"
>>>>             splitOnCaseChange="0" />
>>>>     </analyzer>
>>>> </fieldtype>
>>>>
>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>
>>>> Basically, both at index and query time the field value is normalized
>>>> like this.
>>>>
>>>> Mag. 778 G 69 => mag778g69
>>>>
>>>> Now, in my solrconfig I'm using a search handler like this:
>>>> fossero solo le sue le gambe
>>>>
>>>> <requestHandler ....>
>>>>     ...
>>>>     <str name="defType">dismax</str>
>>>>     ...
>>>>     <str name="mm">100%</str>
>>>>     <str name="qf">myfield^3000</str>
>>>>     <str name="pf">myfield^30000</str>
>>>>
>>>> </requestHandler>
>>>>
>>>> What I'm expecting is that if I index a document with a value for my
>>>> field "Mag. 778 G 69", I will be able to get this document by querying
>>>>
>>>> 1. Mag. 778 G 69
>>>> 2. mag 778 g69
>>>> 3. mag778g69
>>>>
>>>> But that doesn't wotk: i'm able to get the document only and if only I
>>>> use the "normalized2 form: mag778g69
>>>>
>>>> After doing a little bit of debug, I see that, even I used a
>>>> KeywordTokenizer in my field type declaration, SOLR is doing soemthign like
>>>> this:
>>>> /
>>>> // +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1)
>>>> DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1)
>>>> DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1)
>>>> DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4)
>>>> DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/
>>>>
>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>> 69) and obviously querying the field for separate tokens doesn't match
>>>> anything (at least this is what I think)
>>>>
>>>> Does anybody could please explain me that?
>>>>
>>>> Thanks in advance
>>>> Andrea
>>>>
>>>
>>>
>>
>

Re: Tokenization at query time

Posted by Andrea Gazzarini <an...@gmail.com>.

Clear, thanks for response.

So, if I have two fields

<fieldtype name="type1" class="solr.TextField" >
     <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory*" />
         <filter class="solr.LowerCaseFilterFactory" />
         <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="0" generateNumberParts="0"
             catenateWords="0" catenateNumbers="0" catenateAll="1" 
splitOnCaseChange="0" />
     </analyzer>
</fieldtype>
<fieldtype name="type2" class="solr.TextField" >
     <analyzer>
         <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-FoldToASCII.txt"/>
         <tokenizer class="solr.WhitespaceTokenizerFactory" />
         <filter class="solr.LowerCaseFilterFactory" />
         <filter class="solr.WordDelimiterFilterFactory" .../>
     </analyzer>
</fieldtype>

(first field type *Mag. 78 D 99* becomes *mag78d99* while second field 
type ends with several tokens)

And I want to use the same request handler to query against both of 
them. I mean I want the user search something like

http//..../search?q=Mag 78 D 99

and this search should search within both the first (with type1) and 
second (with type 2) by matching

- a document which has field_with_type1 equals to *mag78d99* or
- a document which has field_with_type2 that contains a text like "go to 
*mag 78*, class *d* and subclass *99*)

<requestHandler ....>
     ...
     <str name="defType">dismax</str>
     ...
     <str name="mm">100%</str>
     <str name="qf">
         field_with_type1
         field_with_type_2
     </str>
     ...
</requestHandler>

is not possible? If so, is possible to do that in some other way?

Sorry for the long email and thanks again
Andrea

On 08/12/2013 04:01 PM, Jack Krupansky wrote:
> Quoted phrases will be passed to the analyzer as one string, so there 
> a white space tokenizer is needed.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Andrea Gazzarini
> Sent: Monday, August 12, 2013 6:52 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Tokenization at query time
>
> Hi Tanguy,
> thanks for fast response. What you are saying corresponds perfectly with
> the behaviour I'm observing.
> Now, other than having a big problem (I have several other fields both
> in the pf and qf where spaces doesn't matter, field types like the
> "text_en" field type in the example schema) what I'm wondering is:
>
> /"The query parser splits the input query on white spaces, and the each
> token is analysed according to your configuration"//
> /
> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
> analyzer? If the input query is already parsed (i.e. whitespace
> tokenized) what is its effect?
>
> Thank you very much for the help
> Andrea
>
> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>> Hello Andrea,
>> I think you face a rather common issue involving keyword tokenization 
>> and query parsing in Lucene:
>> The query parser splits the input query on white spaces, and then 
>> each token is analysed according to your configuration.
>> So those queries with a whitespace won't behave as expected because 
>> each token is analysed separately. Consequently, the catenated 
>> version of the reference cannot be generated.
>> I think you could try surrounding your query with double quotes or 
>> escaping the space characters in your query using a backslash so that 
>> the whole sequence is analysed in the same analyser and the 
>> catenation occurs.
>> You should be aware that this approach has a drawback: you will 
>> probably not be able to combine the search for Mag. 778 G 69 with 
>> other words in other fields unless you are able to identify which 
>> spaces are to be escaped:
>> For example, if input the query is:
>> Awesome Mag. 778 G 69
>> you would want to transform it to:
>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>> or
>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase 
>> query
>>
>> Do you get the point?
>>
>> Look at the differences between what you tried and the following 
>> examples which should all do what you want:
>> http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax 
>>
>> OR
>> http://localhost:8983/solr/collection1/select?q=myfield:Mag.\%20778\%20G\%2069&debugQuery=on 
>>
>> OR
>> http://localhost:8983/solr/collection1/select?q=Mag.\%20778\%20G\%2069&debugQuery=on&qf=text%20myfield&defType=edismax 
>>
>>
>> I hope this helps
>>
>> Tanguy
>>
>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini 
>> <an...@gmail.com> wrote:
>>
>>> Hi all,
>>> I have a field (among others)in my schema defined like this:
>>>
>>> <fieldtype name="mytype" class="solr.TextField" 
>>> positionIncrementGap="100">
>>>     <analyzer>
>>>         <tokenizer class="solr.*KeywordTokenizerFactory*" />
>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>         <filter class="solr.WordDelimiterFilterFactory"
>>>             generateWordParts="0"
>>>             generateNumberParts="0"
>>>             catenateWords="0"
>>>             catenateNumbers="0"
>>>             catenateAll="1"
>>>             splitOnCaseChange="0" />
>>>     </analyzer>
>>> </fieldtype>
>>>
>>> <field name="myfield" type="mytype" indexed="true"/>
>>>
>>> Basically, both at index and query time the field value is 
>>> normalized like this.
>>>
>>> Mag. 778 G 69 => mag778g69
>>>
>>> Now, in my solrconfig I'm using a search handler like this:
>>> fossero solo le sue le gambe
>>> <requestHandler ....>
>>>     ...
>>>     <str name="defType">dismax</str>
>>>     ...
>>>     <str name="mm">100%</str>
>>>     <str name="qf">myfield^3000</str>
>>>     <str name="pf">myfield^30000</str>
>>>
>>> </requestHandler>
>>>
>>> What I'm expecting is that if I index a document with a value for my 
>>> field "Mag. 778 G 69", I will be able to get this document by querying
>>>
>>> 1. Mag. 778 G 69
>>> 2. mag 778 g69
>>> 3. mag778g69
>>>
>>> But that doesn't wotk: i'm able to get the document only and if only 
>>> I use the "normalized2 form: mag778g69
>>>
>>> After doing a little bit of debug, I see that, even I used a 
>>> KeywordTokenizer in my field type declaration, SOLR is doing 
>>> soemthign like this:
>>> /
>>> // +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1) 
>>> DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1) 
>>> DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1) 
>>> DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4) 
>>> DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/
>>>
>>> That is, it is tokenizing the original query string (mag + 778 + g + 
>>> 69) and obviously querying the field for separate tokens doesn't 
>>> match anything (at least this is what I think)
>>>
>>> Does anybody could please explain me that?
>>>
>>> Thanks in advance
>>> Andrea
>>
>

Re: Tokenization at query time

Posted by Jack Krupansky <ja...@basetechnology.com>.

Quoted phrases will be passed to the analyzer as one string, so there a 
white space tokenizer is needed.

-- Jack Krupansky

-----Original Message----- 
From: Andrea Gazzarini
Sent: Monday, August 12, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Tokenization at query time

Hi Tanguy,
thanks for fast response. What you are saying corresponds perfectly with
the behaviour I'm observing.
Now, other than having a big problem (I have several other fields both
in the pf and qf where spaces doesn't matter, field types like the
"text_en" field type in the example schema) what I'm wondering is:

/"The query parser splits the input query on white spaces, and the each
token is analysed according to your configuration"//
/
Is there a valid reason to declare a WhiteSpaceTokenizer in a query
analyzer? If the input query is already parsed (i.e. whitespace
tokenized) what is its effect?

Thank you very much for the help
Andrea

On 08/12/2013 12:37 PM, Tanguy Moal wrote:
> Hello Andrea,
> I think you face a rather common issue involving keyword tokenization and 
> query parsing in Lucene:
> The query parser splits the input query on white spaces, and then each 
> token is analysed according to your configuration.
> So those queries with a whitespace won't behave as expected because each 
> token is analysed separately. Consequently, the catenated version of the 
> reference cannot be generated.
> I think you could try surrounding your query with double quotes or 
> escaping the space characters in your query using a backslash so that the 
> whole sequence is analysed in the same analyser and the catenation occurs.
> You should be aware that this approach has a drawback: you will probably 
> not be able to combine the search for Mag. 778 G 69 with other words in 
> other fields unless you are able to identify which spaces are to be 
> escaped:
> For example, if input the query is:
> Awesome Mag. 778 G 69
> you would want to transform it to:
> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
> or
> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase 
> query
>
> Do you get the point?
>
> Look at the differences between what you tried and the following examples 
> which should all do what you want:
> http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax
> OR
> http://localhost:8983/solr/collection1/select?q=myfield:Mag.\%20778\%20G\%2069&debugQuery=on
> OR
> http://localhost:8983/solr/collection1/select?q=Mag.\%20778\%20G\%2069&debugQuery=on&qf=text%20myfield&defType=edismax
>
> I hope this helps
>
> Tanguy
>
> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini 
> <an...@gmail.com> wrote:
>
>> Hi all,
>> I have a field (among others)in my schema defined like this:
>>
>> <fieldtype name="mytype" class="solr.TextField" 
>> positionIncrementGap="100">
>>     <analyzer>
>>         <tokenizer class="solr.*KeywordTokenizerFactory*" />
>>         <filter class="solr.LowerCaseFilterFactory" />
>>         <filter class="solr.WordDelimiterFilterFactory"
>>             generateWordParts="0"
>>             generateNumberParts="0"
>>             catenateWords="0"
>>             catenateNumbers="0"
>>             catenateAll="1"
>>             splitOnCaseChange="0" />
>>     </analyzer>
>> </fieldtype>
>>
>> <field name="myfield" type="mytype" indexed="true"/>
>>
>> Basically, both at index and query time the field value is normalized 
>> like this.
>>
>> Mag. 778 G 69 => mag778g69
>>
>> Now, in my solrconfig I'm using a search handler like this:
>>
>> <requestHandler ....>
>>     ...
>>     <str name="defType">dismax</str>
>>     ...
>>     <str name="mm">100%</str>
>>     <str name="qf">myfield^3000</str>
>>     <str name="pf">myfield^30000</str>
>>
>> </requestHandler>
>>
>> What I'm expecting is that if I index a document with a value for my 
>> field "Mag. 778 G 69", I will be able to get this document by querying
>>
>> 1. Mag. 778 G 69
>> 2. mag 778 g69
>> 3. mag778g69
>>
>> But that doesn't wotk: i'm able to get the document only and if only I 
>> use the "normalized2 form: mag778g69
>>
>> After doing a little bit of debug, I see that, even I used a 
>> KeywordTokenizer in my field type declaration, SOLR is doing soemthign 
>> like this:
>> /
>> // +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1) 
>> DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1) 
>> DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1) 
>> DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4) 
>> DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/
>>
>> That is, it is tokenizing the original query string (mag + 778 + g + 69) 
>> and obviously querying the field for separate tokens doesn't match 
>> anything (at least this is what I think)
>>
>> Does anybody could please explain me that?
>>
>> Thanks in advance
>> Andrea
>

Re: Tokenization at query time

Posted by Andrea Gazzarini <an...@gmail.com>.

Hi Tanguy,
thanks for fast response. What you are saying corresponds perfectly with 
the behaviour I'm observing.
Now, other than having a big problem (I have several other fields both 
in the pf and qf where spaces doesn't matter, field types like the 
"text_en" field type in the example schema) what I'm wondering is:

/"The query parser splits the input query on white spaces, and the each 
token is analysed according to your configuration"//
/
Is there a valid reason to declare a WhiteSpaceTokenizer in a query 
analyzer? If the input query is already parsed (i.e. whitespace 
tokenized) what is its effect?

Thank you very much for the help
Andrea

On 08/12/2013 12:37 PM, Tanguy Moal wrote:
> Hello Andrea,
> I think you face a rather common issue involving keyword tokenization and query parsing in Lucene:
> The query parser splits the input query on white spaces, and then each token is analysed according to your configuration.
> So those queries with a whitespace won't behave as expected because each token is analysed separately. Consequently, the catenated version of the reference cannot be generated.
> I think you could try surrounding your query with double quotes or escaping the space characters in your query using a backslash so that the whole sequence is analysed in the same analyser and the catenation occurs.
> You should be aware that this approach has a drawback: you will probably not be able to combine the search for Mag. 778 G 69 with other words in other fields unless you are able to identify which spaces are to be escaped:
> For example, if input the query is:
> Awesome Mag. 778 G 69
> you would want to transform it to:
> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
> or
> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase query
>
> Do you get the point?
>
> Look at the differences between what you tried and the following examples which should all do what you want:
> http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax
> OR
> http://localhost:8983/solr/collection1/select?q=myfield:Mag.\%20778\%20G\%2069&debugQuery=on
> OR
> http://localhost:8983/solr/collection1/select?q=Mag.\%20778\%20G\%2069&debugQuery=on&qf=text%20myfield&defType=edismax
>
> I hope this helps
>
> Tanguy
>
> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <an...@gmail.com> wrote:
>
>> Hi all,
>> I have a field (among others)in my schema defined like this:
>>
>> <fieldtype name="mytype" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer>
>>         <tokenizer class="solr.*KeywordTokenizerFactory*" />
>>         <filter class="solr.LowerCaseFilterFactory" />
>>         <filter class="solr.WordDelimiterFilterFactory"
>>             generateWordParts="0"
>>             generateNumberParts="0"
>>             catenateWords="0"
>>             catenateNumbers="0"
>>             catenateAll="1"
>>             splitOnCaseChange="0" />
>>     </analyzer>
>> </fieldtype>
>>
>> <field name="myfield" type="mytype" indexed="true"/>
>>
>> Basically, both at index and query time the field value is normalized like this.
>>
>> Mag. 778 G 69 => mag778g69
>>
>> Now, in my solrconfig I'm using a search handler like this:
>>
>> <requestHandler ....>
>>     ...
>>     <str name="defType">dismax</str>
>>     ...
>>     <str name="mm">100%</str>
>>     <str name="qf">myfield^3000</str>
>>     <str name="pf">myfield^30000</str>
>>
>> </requestHandler>
>>
>> What I'm expecting is that if I index a document with a value for my field "Mag. 778 G 69", I will be able to get this document by querying
>>
>> 1. Mag. 778 G 69
>> 2. mag 778 g69
>> 3. mag778g69
>>
>> But that doesn't wotk: i'm able to get the document only and if only I use the "normalized2 form: mag778g69
>>
>> After doing a little bit of debug, I see that, even I used a KeywordTokenizer in my field type declaration, SOLR is doing soemthign like this:
>> /
>> // +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1) DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1) DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1) DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4) DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/
>>
>> That is, it is tokenizing the original query string (mag + 778 + g + 69) and obviously querying the field for separate tokens doesn't match anything (at least this is what I think)
>>
>> Does anybody could please explain me that?
>>
>> Thanks in advance
>> Andrea
>

Re: Tokenization at query time

Posted by Tanguy Moal <ta...@gmail.com>.

Hello Andrea,
I think you face a rather common issue involving keyword tokenization and query parsing in Lucene:
The query parser splits the input query on white spaces, and then each token is analysed according to your configuration.
So those queries with a whitespace won't behave as expected because each token is analysed separately. Consequently, the catenated version of the reference cannot be generated.
I think you could try surrounding your query with double quotes or escaping the space characters in your query using a backslash so that the whole sequence is analysed in the same analyser and the catenation occurs.
You should be aware that this approach has a drawback: you will probably not be able to combine the search for Mag. 778 G 69 with other words in other fields unless you are able to identify which spaces are to be escaped:
For example, if input the query is:
Awesome Mag. 778 G 69
you would want to transform it to:
Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
or
Awesome "Mag. 778 G 69" // only the reference is turned into a phrase query

Do you get the point?

Look at the differences between what you tried and the following examples which should all do what you want:
http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax
OR
http://localhost:8983/solr/collection1/select?q=myfield:Mag.\%20778\%20G\%2069&debugQuery=on
OR
http://localhost:8983/solr/collection1/select?q=Mag.\%20778\%20G\%2069&debugQuery=on&qf=text%20myfield&defType=edismax

I hope this helps

Tanguy

On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <an...@gmail.com> wrote:

> Hi all,
> I have a field (among others)in my schema defined like this:
> 
> <fieldtype name="mytype" class="solr.TextField" positionIncrementGap="100">
>    <analyzer>
>        <tokenizer class="solr.*KeywordTokenizerFactory*" />
>        <filter class="solr.LowerCaseFilterFactory" />
>        <filter class="solr.WordDelimiterFilterFactory"
>            generateWordParts="0"
>            generateNumberParts="0"
>            catenateWords="0"
>            catenateNumbers="0"
>            catenateAll="1"
>            splitOnCaseChange="0" />
>    </analyzer>
> </fieldtype>
> 
> <field name="myfield" type="mytype" indexed="true"/>
> 
> Basically, both at index and query time the field value is normalized like this.
> 
> Mag. 778 G 69 => mag778g69
> 
> Now, in my solrconfig I'm using a search handler like this:
> 
> <requestHandler ....>
>    ...
>    <str name="defType">dismax</str>
>    ...
>    <str name="mm">100%</str>
>    <str name="qf">myfield^3000</str>
>    <str name="pf">myfield^30000</str>
> 
> </requestHandler>
> 
> What I'm expecting is that if I index a document with a value for my field "Mag. 778 G 69", I will be able to get this document by querying
> 
> 1. Mag. 778 G 69
> 2. mag 778 g69
> 3. mag778g69
> 
> But that doesn't wotk: i'm able to get the document only and if only I use the "normalized2 form: mag778g69
> 
> After doing a little bit of debug, I see that, even I used a KeywordTokenizer in my field type declaration, SOLR is doing soemthign like this:
> /
> // +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1) DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1) DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1) DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4) DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/
> 
> That is, it is tokenizing the original query string (mag + 778 + g + 69) and obviously querying the field for separate tokens doesn't match anything (at least this is what I think)
> 
> Does anybody could please explain me that?
> 
> Thanks in advance
> Andrea