You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jason Chaffee <jc...@ebates.com> on 2010/03/25 23:52:42 UTC

keyword query tokenizer

I have the following configured for a particular field:

 

      <analyzer type="query">

        <tokenizer class="solr.KeywordTokenizerFactory" />

        <filter class="solr.LowerCaseFilterFactory" />

      </analyzer>

 

 

I am using dismax and querying multiple fields and I expect the query to
be parsed different for each field.  For some reason, it is not kept as
single token for this field's query.  For example, the query "Apple
Store"  is being broken into two tokens, "apple" and "store".  I would
expect it to be "apple store". 

 

Does anyone have ideas of what might be going on here?

 

Thanks,

 

Jason

RE: keyword query tokenizer

Posted by Jason Chaffee <jc...@ebates.com>.

Got it working, there was a typo.

-----Original Message-----
From: Jason Chaffee [mailto:jchaffee@ebates.com] 
Sent: Friday, March 26, 2010 1:05 PM
To: solr-user@lucene.apache.org
Subject: RE: keyword query tokenizer

I tried escaping the whitespace, but no avail.  It is still be broken into two tokens and the whitespace.  Has anyone else tried this?

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com] 
Sent: Thursday, March 25, 2010 4:05 PM
To: solr-user@lucene.apache.org
Subject: Re: keyword query tokenizer

> I have the following configured for a
> particular field:
> 
>  
> 
>       <analyzer type="query">
> 
>         <tokenizer
> class="solr.KeywordTokenizerFactory" />
> 
>         <filter
> class="solr.LowerCaseFilterFactory" />
> 
>       </analyzer>
> 
>  
> 
>  
> 
> I am using dismax and querying multiple fields and I expect
> the query to
> be parsed different for each field.  For some reason,
> it is not kept as
> single token for this field's query.  For example, the
> query "Apple
> Store"  is being broken into two tokens, "apple" and
> "store".  I would
> expect it to be "apple store". 
> 
>  
> 
> Does anyone have ideas of what might be going on here?

Before analysis phase, QueryParser splits on whitespace. You can alter this behavior by escaping whitespace with back slash. apple\ store

RE: keyword query tokenizer

Posted by Jason Chaffee <jc...@ebates.com>.

I tried escaping the whitespace, but no avail.  It is still be broken into two tokens and the whitespace.  Has anyone else tried this?

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com] 
Sent: Thursday, March 25, 2010 4:05 PM
To: solr-user@lucene.apache.org
Subject: Re: keyword query tokenizer

> I have the following configured for a
> particular field:
> 
>  
> 
>       <analyzer type="query">
> 
>         <tokenizer
> class="solr.KeywordTokenizerFactory" />
> 
>         <filter
> class="solr.LowerCaseFilterFactory" />
> 
>       </analyzer>
> 
>  
> 
>  
> 
> I am using dismax and querying multiple fields and I expect
> the query to
> be parsed different for each field.  For some reason,
> it is not kept as
> single token for this field's query.  For example, the
> query "Apple
> Store"  is being broken into two tokens, "apple" and
> "store".  I would
> expect it to be "apple store". 
> 
>  
> 
> Does anyone have ideas of what might be going on here?

Before analysis phase, QueryParser splits on whitespace. You can alter this behavior by escaping whitespace with back slash. apple\ store

RE: keyword query tokenizer

Posted by Jason Chaffee <jc...@ebates.com>.

Seems like a short-coming to me.  I would rather it not parse it unless
there is some delimiter to break it, other than white space since white
space is used in phrases.   


Instead of this,

"name:bob content:climate"

Maybe use a comma,

"name:bob,content:climate"

"name:bob foo bar,content:climate control"

Then, I can also pass phrases to those fields and allow the analyzers to
handle the tokenizing.

Jason

-----Original Message-----
From: Tommy Chheng [mailto:tommy.chheng@gmail.com] 
Sent: Thursday, March 25, 2010 8:25 PM
To: solr-user@lucene.apache.org
Subject: Re: keyword query tokenizer

  Multi-field searches is one reason of doing the tokenizing in the
parser.

Imagine if your query was "name:bob content:climate"

The parser can tokenize the query into "name:bob", "content:climate" and

pass each into their own analyzer.

Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/25/10 7:37 PM, Jason Chaffee wrote:
> I am curious as to why the query parser does any tokenizing?  I would 
> think you would want control/configure this with your analyzers?
>
> Does anyone know the answer to this. Is there a performance gain or 
> something?
>
> Thanks,
>
> Jason
>
> On Mar 25, 2010, at 4:04 PM, "Ahmet Arslan" <io...@yahoo.com> wrote:
>
>> > I have the following configured for a
>> > particular field:
>> >
>> >
>> >
>> > <analyzer type="query">
>> >
>> > <tokenizer
>> > class="solr.KeywordTokenizerFactory" />
>> >
>> > <filter
>> > class="solr.LowerCaseFilterFactory" />
>> >
>> > </analyzer>
>> >
>> >
>> >
>> >
>> >
>> > I am using dismax and querying multiple fields and I expect
>> > the query to
>> > be parsed different for each field.  For some reason,
>> > it is not kept as
>> > single token for this field's query.  For example, the
>> > query "Apple
>> > Store"  is being broken into two tokens, "apple" and
>> > "store".  I would
>> > expect it to be "apple store".
>> >
>> >
>> >
>> > Does anyone have ideas of what might be going on here?
>>
>> Before analysis phase, QueryParser splits on whitespace. You can 
>> alter this behavior by escaping whitespace with back slash. apple\
store
>>
>>
>>
>

Re: keyword query tokenizer

Posted by Tommy Chheng <to...@gmail.com>.

  Multi-field searches is one reason of doing the tokenizing in the parser.

Imagine if your query was "name:bob content:climate"

The parser can tokenize the query into "name:bob", "content:climate" and 
pass each into their own analyzer.

Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/25/10 7:37 PM, Jason Chaffee wrote:
> I am curious as to why the query parser does any tokenizing?  I would 
> think you would want control/configure this with your analyzers?
>
> Does anyone know the answer to this. Is there a performance gain or 
> something?
>
> Thanks,
>
> Jason
>
> On Mar 25, 2010, at 4:04 PM, "Ahmet Arslan" <io...@yahoo.com> wrote:
>
>> > I have the following configured for a
>> > particular field:
>> >
>> >
>> >
>> > <analyzer type="query">
>> >
>> > <tokenizer
>> > class="solr.KeywordTokenizerFactory" />
>> >
>> > <filter
>> > class="solr.LowerCaseFilterFactory" />
>> >
>> > </analyzer>
>> >
>> >
>> >
>> >
>> >
>> > I am using dismax and querying multiple fields and I expect
>> > the query to
>> > be parsed different for each field.  For some reason,
>> > it is not kept as
>> > single token for this field's query.  For example, the
>> > query "Apple
>> > Store"  is being broken into two tokens, "apple" and
>> > "store".  I would
>> > expect it to be "apple store".
>> >
>> >
>> >
>> > Does anyone have ideas of what might be going on here?
>>
>> Before analysis phase, QueryParser splits on whitespace. You can 
>> alter this behavior by escaping whitespace with back slash. apple\ store
>>
>>
>>
>

RE: keyword query tokenizer

Posted by Jason Chaffee <jc...@ebates.com>.

I didn't know the quotes would work.  I thought it had to be escaped and
I wasn't too fond of that because you have to unescape in the analysis
phase.  Using quotes doesn't seem so bad to me.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Monday, March 29, 2010 11:16 AM
To: solr-user@lucene.apache.org
Subject: RE: keyword query tokenizer

: Ahh, but that is exactly what I don't want the DisjunctionMaxQuery to
: do.  I do not max scoring field per "word".  Instead, I want it per
: "phrase" which may be a single word or multiple words.

then you need to quote your enitre "q" param. (or escape all the white 
space and meta characters)

: You may think "but i'm using dismax, why does dismax need to worry
about
: 
: that?" but the key to remember there is that if dismax didn't split on

: whitespace prior to analysis, it wouldn't be able to build the 
: DisjunctionMaxQuery's that it uses to find the max scoring field per 
: "word" (which is the whole point of hte parser).

-Hoss

RE: keyword query tokenizer

Posted by Chris Hostetter <ho...@fucit.org>.

: Ahh, but that is exactly what I don't want the DisjunctionMaxQuery to
: do.  I do not max scoring field per "word".  Instead, I want it per
: "phrase" which may be a single word or multiple words.

then you need to quote your enitre "q" param. (or escape all the white 
space and meta characters)

: You may think "but i'm using dismax, why does dismax need to worry about
: 
: that?" but the key to remember there is that if dismax didn't split on 
: whitespace prior to analysis, it wouldn't be able to build the 
: DisjunctionMaxQuery's that it uses to find the max scoring field per 
: "word" (which is the whole point of hte parser).


-Hoss

RE: keyword query tokenizer

Posted by Jason Chaffee <jc...@ebates.com>.

Ahh, but that is exactly what I don't want the DisjunctionMaxQuery to
do.  I do not max scoring field per "word".  Instead, I want it per
"phrase" which may be a single word or multiple words.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Friday, March 26, 2010 10:35 PM
To: solr-user@lucene.apache.org
Subject: Re: keyword query tokenizer

: 
: I am curious as to why the query parser does any tokenizing?  I would
think
: you would want control/configure this with your analyzers?
: 
: Does anyone know the answer to this. Is there a performance gain or
something?

it's not about performance, it's about hte query parser syntax.

whitespace is "markup" as far as the query parser is concerned -- just 
like +,-, etc.. whitespace characters are instructions for the query 
parsers.  

Essentially: unquoted whitespace is the markup that tells the query
parser 
to create an "OR" query out of the "chunks" of input on either side of
hte 
space (+ signifies MUST, - signifies PROHIBITED, but there is no markup
to 
signify "SHOULD")

Also: if the query parser didn't chunk on whitespace queries like
this...

	aWord aField:anotherWord

...wouldn't work in the standard query parser.  

You may think "but i'm using dismax, why does dismax need to worry about

that?" but the key to remember there is that if dismax didn't split on 
whitespace prior to analysis, it wouldn't be able to build the 
DisjunctionMaxQuery's that it uses to find the max scoring field per 
"word" (which is the whole point of hte parser).

-Hoss

Re: keyword query tokenizer

Posted by Chris Hostetter <ho...@fucit.org>.

: 
: I am curious as to why the query parser does any tokenizing?  I would think
: you would want control/configure this with your analyzers?
: 
: Does anyone know the answer to this. Is there a performance gain or something?

it's not about performance, it's about hte query parser syntax.

whitespace is "markup" as far as the query parser is concerned -- just 
like +,-, etc.. whitespace characters are instructions for the query 
parsers.  

Essentially: unquoted whitespace is the markup that tells the query parser 
to create an "OR" query out of the "chunks" of input on either side of hte 
space (+ signifies MUST, - signifies PROHIBITED, but there is no markup to 
signify "SHOULD")

Also: if the query parser didn't chunk on whitespace queries like this...

	aWord aField:anotherWord

...wouldn't work in the standard query parser.  

You may think "but i'm using dismax, why does dismax need to worry about 
that?" but the key to remember there is that if dismax didn't split on 
whitespace prior to analysis, it wouldn't be able to build the 
DisjunctionMaxQuery's that it uses to find the max scoring field per 
"word" (which is the whole point of hte parser).



-Hoss

Re: keyword query tokenizer

Posted by Jason Chaffee <jc...@ebates.com>.

I am curious as to why the query parser does any tokenizing?  I would  
think you would want control/configure this with your analyzers?

Does anyone know the answer to this. Is there a performance gain or  
something?

Thanks,

Jason

On Mar 25, 2010, at 4:04 PM, "Ahmet Arslan" <io...@yahoo.com> wrote:

> > I have the following configured for a
> > particular field:
> >
> >
> >
> >       <analyzer type="query">
> >
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory" />
> >
> >         <filter
> > class="solr.LowerCaseFilterFactory" />
> >
> >       </analyzer>
> >
> >
> >
> >
> >
> > I am using dismax and querying multiple fields and I expect
> > the query to
> > be parsed different for each field.  For some reason,
> > it is not kept as
> > single token for this field's query.  For example, the
> > query "Apple
> > Store"  is being broken into two tokens, "apple" and
> > "store".  I would
> > expect it to be "apple store".
> >
> >
> >
> > Does anyone have ideas of what might be going on here?
>
> Before analysis phase, QueryParser splits on whitespace. You can  
> alter this behavior by escaping whitespace with back slash. apple\  
> store
>
>
>

Re: keyword query tokenizer

Posted by Jason Chaffee <jc...@ebates.com>.

Thanks, didn't realize that.

On Mar 25, 2010, at 4:04 PM, "Ahmet Arslan" <io...@yahoo.com> wrote:

> > I have the following configured for a
> > particular field:
> >
> >
> >
> >       <analyzer type="query">
> >
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory" />
> >
> >         <filter
> > class="solr.LowerCaseFilterFactory" />
> >
> >       </analyzer>
> >
> >
> >
> >
> >
> > I am using dismax and querying multiple fields and I expect
> > the query to
> > be parsed different for each field.  For some reason,
> > it is not kept as
> > single token for this field's query.  For example, the
> > query "Apple
> > Store"  is being broken into two tokens, "apple" and
> > "store".  I would
> > expect it to be "apple store".
> >
> >
> >
> > Does anyone have ideas of what might be going on here?
>
> Before analysis phase, QueryParser splits on whitespace. You can  
> alter this behavior by escaping whitespace with back slash. apple\  
> store
>
>
>

Re: keyword query tokenizer

Posted by Ahmet Arslan <io...@yahoo.com>.

> I have the following configured for a
> particular field:
> 
>  
> 
>       <analyzer type="query">
> 
>         <tokenizer
> class="solr.KeywordTokenizerFactory" />
> 
>         <filter
> class="solr.LowerCaseFilterFactory" />
> 
>       </analyzer>
> 
>  
> 
>  
> 
> I am using dismax and querying multiple fields and I expect
> the query to
> be parsed different for each field.  For some reason,
> it is not kept as
> single token for this field's query.  For example, the
> query "Apple
> Store"  is being broken into two tokens, "apple" and
> "store".  I would
> expect it to be "apple store". 
> 
>  
> 
> Does anyone have ideas of what might be going on here?

Before analysis phase, QueryParser splits on whitespace. You can alter this behavior by escaping whitespace with back slash. apple\ store