You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Walt Stoneburner <wa...@gmail.com> on 2007/07/13 17:13:11 UTC

Standard Analyzer Escapes

In reading the documentation for escape characters, I'm having a
little trouble understanding what it wants me to do for certain
special cases.

http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Characters
says: "Lucene supports escaping special characters that are part of
the query syntax. The current list special characters are:   + - && ||
! ( ) { } [ ] ^ " ~ * ? : \     To escape these character use the \
before the character."

Specifically, I'm curious about the double characters && and || and
how they should be properly escaped.

Experimentation showed some very strange things with the StandardAnalyzer.

Using Luke, I get some interesting mappings.
  AT&T    becomes  at&t    (as expected)
  AT&&T  becomes  t   (tricky... at is now taken as a stop word; fine
makes sense)

..but what about...   "AT&&T"   ...nope, still t.

AAA&BBB becomes aaa&bbb    ...correct
AAA&&BBB becomes   aaa bbb   ...ampersand becomes a space?
"AAA&&BBB" is also    aaa bbb

AAA\&BBB correctly is   aaa&bbb   ...just as before
AAA\&&BBB   is  aaa bbb   ...but perhaps we got the escape wrong.

Is '&&' special "character" and is it escaped as \&& or escaped as
\&\& ...let's find out.

AAA\&\&BBB   is also  aaa bbb   ...perhaps we need quotes?
"AAA\&\&BBB"   is also  aaa bbb   ...I can't seem to get the escape to work.

How about this?
AAA&BBB&CCC    strangely becomes   aaa&bbb ccc

Even when escaped?
AAA\&BBB\&CCC  is also    aaa&bbb ccc    ...appears so.

What about...
AAA&BBB&CCC&DDD   becomes   aaa&bbb ccc&ddd  ....whoa, not expecting that.

AAA&&BBB&&CCC&&DDD  becomes   aaa bbb ccc ddd  ...if && means AND, ok...

AAA\&&BBB\&&CCC\&&DDD   no change  aaa bbb ccc ddd

AAA\&\&BBB\&\&CCC\&\&DDD  also no change  aaa bbb ccc ddd


It appears I literally cannot search for the token with two ampersands
in it, whether they are touching or not.

Clearly I'm missing something.  Is there a way to get any literal
sequence of my choosing, using escapes, as a term in the Lucene
expression?

-Walt Stoneburner

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Standard Analyzer Escapes

Posted by Mark Miller <ma...@gmail.com>.

This is certainly the case. StandardAnalyzer has a regex matcher that 
looks for a possible company name involving an & or an @. The 
QueryParser is escaping the '&' -- all of the affects described are 
standard results of using the StandardAnalzyer. Any double '&&' will 
break text, but 'sdfdf&dfsdf' will match as a company name. Escaping 
will not affect the matches that StandardAnalyzer tries to make, it will 
just keep the QueryParser from matching the escapee as an operator.

'sdfdf&dfsdf&sdfd' will match to company name: sdfdf&dfsdf and then 
token: sdfd...the second '&' breaks, the first causes a company match. 
Check out the regex in StandardTokenizer.jj.

Also, to point out, there is no 'real' literal search in Lucene. 
Anything in quotes gets passed to the Analyzer, so you will get similar 
results whether you use quotes or not.

- Mark

Yonik Seeley wrote:
> I just tried some things fast via the Solr admin interface, and
> everything seems fine.
> I think you are probably confusing what the parser does vs what the
> analyzer does.
> Try your tests with an un-tokenized field to remove that effect.
>
> -Yonik
>
> On 7/13/07, Walt Stoneburner <wa...@gmail.com> wrote:
>> In reading the documentation for escape characters, I'm having a
>> little trouble understanding what it wants me to do for certain
>> special cases.
>>
>> http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Characters 
>>
>> says: "Lucene supports escaping special characters that are part of
>> the query syntax. The current list special characters are:   + - && ||
>> ! ( ) { } [ ] ^ " ~ * ? : \     To escape these character use the \
>> before the character."
>>
>> Specifically, I'm curious about the double characters && and || and
>> how they should be properly escaped.
>>
>> Experimentation showed some very strange things with the 
>> StandardAnalyzer.
>>
>> Using Luke, I get some interesting mappings.
>>   AT&T    becomes  at&t    (as expected)
>>   AT&&T  becomes  t   (tricky... at is now taken as a stop word; fine
>> makes sense)
>>
>> ..but what about...   "AT&&T"   ...nope, still t.
>>
>> AAA&BBB becomes aaa&bbb    ...correct
>> AAA&&BBB becomes   aaa bbb   ...ampersand becomes a space?
>> "AAA&&BBB" is also    aaa bbb
>>
>> AAA\&BBB correctly is   aaa&bbb   ...just as before
>> AAA\&&BBB   is  aaa bbb   ...but perhaps we got the escape wrong.
>>
>> Is '&&' special "character" and is it escaped as \&& or escaped as
>> \&\& ...let's find out.
>>
>> AAA\&\&BBB   is also  aaa bbb   ...perhaps we need quotes?
>> "AAA\&\&BBB"   is also  aaa bbb   ...I can't seem to get the escape 
>> to work.
>>
>> How about this?
>> AAA&BBB&CCC    strangely becomes   aaa&bbb ccc
>>
>> Even when escaped?
>> AAA\&BBB\&CCC  is also    aaa&bbb ccc    ...appears so.
>>
>> What about...
>> AAA&BBB&CCC&DDD   becomes   aaa&bbb ccc&ddd  ....whoa, not expecting 
>> that.
>>
>> AAA&&BBB&&CCC&&DDD  becomes   aaa bbb ccc ddd  ...if && means AND, ok...
>>
>> AAA\&&BBB\&&CCC\&&DDD   no change  aaa bbb ccc ddd
>>
>> AAA\&\&BBB\&\&CCC\&\&DDD  also no change  aaa bbb ccc ddd
>>
>>
>> It appears I literally cannot search for the token with two ampersands
>> in it, whether they are touching or not.
>>
>> Clearly I'm missing something.  Is there a way to get any literal
>> sequence of my choosing, using escapes, as a term in the Lucene
>> expression?
>>
>> -Walt Stoneburner
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Standard Analyzer Escapes

Posted by Yonik Seeley <yo...@apache.org>.

I just tried some things fast via the Solr admin interface, and
everything seems fine.
I think you are probably confusing what the parser does vs what the
analyzer does.
Try your tests with an un-tokenized field to remove that effect.

-Yonik

On 7/13/07, Walt Stoneburner <wa...@gmail.com> wrote:
> In reading the documentation for escape characters, I'm having a
> little trouble understanding what it wants me to do for certain
> special cases.
>
> http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Characters
> says: "Lucene supports escaping special characters that are part of
> the query syntax. The current list special characters are:   + - && ||
> ! ( ) { } [ ] ^ " ~ * ? : \     To escape these character use the \
> before the character."
>
> Specifically, I'm curious about the double characters && and || and
> how they should be properly escaped.
>
> Experimentation showed some very strange things with the StandardAnalyzer.
>
> Using Luke, I get some interesting mappings.
>   AT&T    becomes  at&t    (as expected)
>   AT&&T  becomes  t   (tricky... at is now taken as a stop word; fine
> makes sense)
>
> ..but what about...   "AT&&T"   ...nope, still t.
>
> AAA&BBB becomes aaa&bbb    ...correct
> AAA&&BBB becomes   aaa bbb   ...ampersand becomes a space?
> "AAA&&BBB" is also    aaa bbb
>
> AAA\&BBB correctly is   aaa&bbb   ...just as before
> AAA\&&BBB   is  aaa bbb   ...but perhaps we got the escape wrong.
>
> Is '&&' special "character" and is it escaped as \&& or escaped as
> \&\& ...let's find out.
>
> AAA\&\&BBB   is also  aaa bbb   ...perhaps we need quotes?
> "AAA\&\&BBB"   is also  aaa bbb   ...I can't seem to get the escape to work.
>
> How about this?
> AAA&BBB&CCC    strangely becomes   aaa&bbb ccc
>
> Even when escaped?
> AAA\&BBB\&CCC  is also    aaa&bbb ccc    ...appears so.
>
> What about...
> AAA&BBB&CCC&DDD   becomes   aaa&bbb ccc&ddd  ....whoa, not expecting that.
>
> AAA&&BBB&&CCC&&DDD  becomes   aaa bbb ccc ddd  ...if && means AND, ok...
>
> AAA\&&BBB\&&CCC\&&DDD   no change  aaa bbb ccc ddd
>
> AAA\&\&BBB\&\&CCC\&\&DDD  also no change  aaa bbb ccc ddd
>
>
> It appears I literally cannot search for the token with two ampersands
> in it, whether they are touching or not.
>
> Clearly I'm missing something.  Is there a way to get any literal
> sequence of my choosing, using escapes, as a term in the Lucene
> expression?
>
> -Walt Stoneburner
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org