You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Gwyn Carwardine <gw...@carwardine.net> on 2006/01/21 14:10:56 UTC

Handling of colons in QueryParserTokenManager

Hello, I'm new here. I've actually started using dotLucene but I think I
need to make a change to the QueryParser but it's so complicated to try and
understand what it's doing I thought I'd ask if maybe one of you guys could
point me in the right direction?

In my implementation of Lucene I have the need to store keywords that are of
the form "<key>:<identity>" for example CI:123. Whilst I can store this in
Lucene using Field.Keyword("ID","CI:123") I can't easily look it up by using
QueryParser which I need to do.

Whenever I parse the query ID:CI:123 it parses it as "ID:ci". Now I've
already made a small hack so that non-tokenized values are indexed as
lowercase so at least I can get them back if I use ID:CI\:123 but colons are
commonly used and I really don't want to have to escape them everywhere

What I want to achieve is that query parser will parse ID:CI:123 as
field(ID) value(CI:123). I understand that colon is a special character but
it's only used to delimit fields and values in which case it makes sense to
react to the first colon, the second colon should be treated as part of the
text which the analyzer could strip out or keep (in my case because I'm
using a custom analyzer).

Does this make sense? How do I go about changing the QueryParserTokenManager
to achieve this? Perhaps you can point me to some documentation that
describes the code even?

Any help gratefully received!

Thanks,
Gwyn Carwardine


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Handling of colons in QueryParserTokenManager

Posted by Daniel Naber <lu...@danielnaber.de>.

On Samstag 21 Januar 2006 19:46, Chris Hostetter wrote:

> if you are flexible in the syntax you are willing to support, you can
> tell your users that they need to escape the colons that aren't ment as
> field identifiers...
>
>         ID:CI\:123

Or you could use a regular expression to turn ID:CI:123 into ID:CI\:123 
before the QueryParser is used. Probably simpler than messing with 
QueryParser.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Handling of colons in QueryParserTokenManager

Posted by Yonik Seeley <ys...@gmail.com>.

I just verified the behavior of an embedded ':' and I agree it's a
problem that needs to be fixed because it currently silently
truncates.

foo:bar:baz is parsed as foo:bar
foo:bar:baz:what is parsed as foo:bar

The parser should either
 - throw an exception
 - treat ':' (and everything after) as part of the field value

As Erik pointed out, this would have to be fixed in the grammar: QueryParser.jj

-Yonik

On 1/23/06, Yonik Seeley <ys...@gmail.com> wrote:
> On 1/23/06, Gwyn Carwardine <gw...@carwardine.net> wrote:
> > the Token Manager's job is to parse into field & value, it shouldn't make
> > any decisions about the value; that value should get passed intact (complete
> > with colons and any other special characters)
>
> It's more a matter of parsing than philosophy... the parser must make
> decisions about what is part of the field value so it can know where
> it is in the grammar.
>
> Examples where the field value is just "bar":
> foo:bar^2
> foo:bar~2
> foo:bar baz
>
> Now your particular case of ':' may be solvable, but the problem in
> general is not.  One must escape special characters to avoid
> ambiguity.
>
> -Yonik
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Handling of colons in QueryParserTokenManager

Posted by Yonik Seeley <ys...@gmail.com>.

On 1/23/06, Gwyn Carwardine <gw...@carwardine.net> wrote:
> the Token Manager's job is to parse into field & value, it shouldn't make
> any decisions about the value; that value should get passed intact (complete
> with colons and any other special characters)

It's more a matter of parsing than philosophy... the parser must make
decisions about what is part of the field value so it can know where
it is in the grammar.

Examples where the field value is just "bar":
foo:bar^2
foo:bar~2
foo:bar baz

Now your particular case of ':' may be solvable, but the problem in
general is not.  One must escape special characters to avoid
ambiguity.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Handling of colons in QueryParserTokenManager

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 23, 2006, at 6:24 AM, Gwyn Carwardine wrote:
> It definitely was producing the error. I was very careful to test  
> before I
> posted. But now, as you say, it doesn't do it.
>
> However, I wonder if I was entering ["Fred" TO "joe"] (note the  
> capital F)
> because that IS still coming back with HTTP 500 error every time.
>
> http://www.lucenebook.com/search?query=%5B%22Fred%22+TO+%22joe%22%5D

Sure enough.  Wow - you win the prize for finding a bug.  I believe,  
but not sure yet, that this is due to a TooManyClauses error.

> Format of a query part in Lucene is field:value
>
> However if the value itself contains a colon then the
> QueryParserTokenManager seems to truncate the value at that point.  
> I would
> like to change this behaviour, in fact I think it's behaving  
> illogically..
> the Token Manager's job is to parse into field & value, it  
> shouldn't make
> any decisions about the value; that value should get passed intact  
> (complete
> with colons and any other special characters) through to the  
> Analyzer who's
> job it is to.. well.. analyse!

QueryParser certainly has issues when it comes to special characters,  
escaping, and analysis.  It's a tricky balancing act, and it  
certainly does not apply in non-standard circumstances.  Most  
commonly special characters aren't relevant to a fields value, as  
they are discarded during analysis.  You've got a special case and  
need to deal with it uniquely, with QueryParser not being suitable.   
Whether it makes sense to adjust QueryParser or not, I'm not sure -  
looks like there is an improvement with colon handling needed.

Again, changes to QueryParser occur in QueryParser.jj, not any of  
the .java files that are generated.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Handling of colons in QueryParserTokenManager

Posted by Gwyn Carwardine <gw...@carwardine.net>.

Thanks for your reply Erik (good book by the way)

It definitely was producing the error. I was very careful to test before I
posted. But now, as you say, it doesn't do it.

However, I wonder if I was entering ["Fred" TO "joe"] (note the capital F)
because that IS still coming back with HTTP 500 error every time.

http://www.lucenebook.com/search?query=%5B%22Fred%22+TO+%22joe%22%5D

I thought I did mention colons!

Format of a query part in Lucene is field:value

However if the value itself contains a colon then the
QueryParserTokenManager seems to truncate the value at that point. I would
like to change this behaviour, in fact I think it's behaving illogically..
the Token Manager's job is to parse into field & value, it shouldn't make
any decisions about the value; that value should get passed intact (complete
with colons and any other special characters) through to the Analyzer who's
job it is to.. well.. analyse!

Gwyn

-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
Sent: 23 January 2006 01:38
To: java-dev@lucene.apache.org
Subject: Re: Handling of colons in QueryParserTokenManager

On Jan 21, 2006, at 2:16 PM, Gwyn Carwardine wrote:
> Of course I think someone needs to go into the internals anyway...  
> on 1.4.3
> I get an index out of array bounds error (not a nice parse  
> exception) when
> it tries to parse the following (which it should be able to do):
>
> ["fred" TO "joe"]
>
> Maybe this is fixed in 1.9 but I tried it on the www.lucenebook.com  
> search
> assuming that was using a recent version and that generates a  
> server error!

It does not generate a server error on lucenebook.com:

<http://www.lucenebook.com/search?query=%5B%22fred%22+TO+%22joe%22%5D>

Maybe you happened to hit the server at some point when there was an  
issue with the server itself (?), but I just tried it and get plenty  
of results.

> It's a real shame that the QueryParserTokenManager had no comments  
> put in to
> explain what on earth it's doing!

Look at QueryParser.jj - that is where the rest is generated from,  
using JavaCC.

Your subject mentions colons, but your example doesn't.  Besides the  
range query example, is there an issue with colons that you want to  
ask about?

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Handling of colons in QueryParserTokenManager

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 21, 2006, at 2:16 PM, Gwyn Carwardine wrote:
> Of course I think someone needs to go into the internals anyway...  
> on 1.4.3
> I get an index out of array bounds error (not a nice parse  
> exception) when
> it tries to parse the following (which it should be able to do):
>
> ["fred" TO "joe"]
>
> Maybe this is fixed in 1.9 but I tried it on the www.lucenebook.com  
> search
> assuming that was using a recent version and that generates a  
> server error!

It does not generate a server error on lucenebook.com:

	<http://www.lucenebook.com/search?query=%5B%22fred%22+TO+%22joe%22%5D>

Maybe you happened to hit the server at some point when there was an  
issue with the server itself (?), but I just tried it and get plenty  
of results.

> It's a real shame that the QueryParserTokenManager had no comments  
> put in to
> explain what on earth it's doing!

Look at QueryParser.jj - that is where the rest is generated from,  
using JavaCC.

Your subject mentions colons, but your example doesn't.  Besides the  
range query example, is there an issue with colons that you want to  
ask about?

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Handling of colons in QueryParserTokenManager

Posted by Gwyn Carwardine <gw...@carwardine.net>.

I don't want the users to have to use escape characters. I'd rather they
didn't have to use quotes.

Of course I think someone needs to go into the internals anyway... on 1.4.3
I get an index out of array bounds error (not a nice parse exception) when
it tries to parse the following (which it should be able to do):

["fred" TO "joe"]

Maybe this is fixed in 1.9 but I tried it on the www.lucenebook.com search
assuming that was using a recent version and that generates a server error!

It's a real shame that the QueryParserTokenManager had no comments put in to
explain what on earth it's doing!



-----Original Message-----
From: hossman@hal.rescomp.berkeley.edu
[mailto:hossman@hal.rescomp.berkeley.edu] On Behalf Of Chris Hostetter
Sent: 21 January 2006 18:46
To: java-dev@lucene.apache.org
Subject: Re: Handling of colons in QueryParserTokenManager


if you are flexible in the syntax you are willing to support, you can tell
your users that they need to escape the colons that aren't ment as field
identifiers...

	ID:CI\:123

...alternately, you can tell them they have to quote colons...

	ID:"CI:123"

...then you can avoid the whole painfull mess of the parser internals.


: Date: Sat, 21 Jan 2006 13:10:56 -0000
: From: Gwyn Carwardine <gw...@carwardine.net>
: Reply-To: java-dev@lucene.apache.org
: To: java-dev@lucene.apache.org
: Subject: Handling of colons in QueryParserTokenManager
:
: Hello, I'm new here. I've actually started using dotLucene but I think I
: need to make a change to the QueryParser but it's so complicated to try
and
: understand what it's doing I thought I'd ask if maybe one of you guys
could
: point me in the right direction?
:
: In my implementation of Lucene I have the need to store keywords that are
of
: the form "<key>:<identity>" for example CI:123. Whilst I can store this in
: Lucene using Field.Keyword("ID","CI:123") I can't easily look it up by
using
: QueryParser which I need to do.
:
: Whenever I parse the query ID:CI:123 it parses it as "ID:ci". Now I've
: already made a small hack so that non-tokenized values are indexed as
: lowercase so at least I can get them back if I use ID:CI\:123 but colons
are
: commonly used and I really don't want to have to escape them everywhere
:
: What I want to achieve is that query parser will parse ID:CI:123 as
: field(ID) value(CI:123). I understand that colon is a special character
but
: it's only used to delimit fields and values in which case it makes sense
to
: react to the first colon, the second colon should be treated as part of
the
: text which the analyzer could strip out or keep (in my case because I'm
: using a custom analyzer).
:
: Does this make sense? How do I go about changing the
QueryParserTokenManager
: to achieve this? Perhaps you can point me to some documentation that
: describes the code even?
:
: Any help gratefully received!
:
: Thanks,
: Gwyn Carwardine
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-dev-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Handling of colons in QueryParserTokenManager

Posted by Chris Hostetter <ho...@fucit.org>.

if you are flexible in the syntax you are willing to support, you can tell
your users that they need to escape the colons that aren't ment as field
identifiers...

	ID:CI\:123

...alternately, you can tell them they have to quote colons...

	ID:"CI:123"

...then you can avoid the whole painfull mess of the parser internals.


: Date: Sat, 21 Jan 2006 13:10:56 -0000
: From: Gwyn Carwardine <gw...@carwardine.net>
: Reply-To: java-dev@lucene.apache.org
: To: java-dev@lucene.apache.org
: Subject: Handling of colons in QueryParserTokenManager
:
: Hello, I'm new here. I've actually started using dotLucene but I think I
: need to make a change to the QueryParser but it's so complicated to try and
: understand what it's doing I thought I'd ask if maybe one of you guys could
: point me in the right direction?
:
: In my implementation of Lucene I have the need to store keywords that are of
: the form "<key>:<identity>" for example CI:123. Whilst I can store this in
: Lucene using Field.Keyword("ID","CI:123") I can't easily look it up by using
: QueryParser which I need to do.
:
: Whenever I parse the query ID:CI:123 it parses it as "ID:ci". Now I've
: already made a small hack so that non-tokenized values are indexed as
: lowercase so at least I can get them back if I use ID:CI\:123 but colons are
: commonly used and I really don't want to have to escape them everywhere
:
: What I want to achieve is that query parser will parse ID:CI:123 as
: field(ID) value(CI:123). I understand that colon is a special character but
: it's only used to delimit fields and values in which case it makes sense to
: react to the first colon, the second colon should be treated as part of the
: text which the analyzer could strip out or keep (in my case because I'm
: using a custom analyzer).
:
: Does this make sense? How do I go about changing the QueryParserTokenManager
: to achieve this? Perhaps you can point me to some documentation that
: describes the code even?
:
: Any help gratefully received!
:
: Thanks,
: Gwyn Carwardine
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-dev-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Handling of colons in QueryParserTokenManager

Posted by Paul Elschot <pa...@xs4all.nl>.

On Monday 23 January 2006 13:10, Gwyn Carwardine wrote:
...
> 
> And now I've been pointed to QueryParser.jj I wonder what language that is?
> And is QueryParser.javaj create from it? If so how does it get from one to
> the other?! Help! 
> 

QueryParser.java is generated from QueryParser.jj by javacc:

https://javacc.dev.java.net/

The easiest way to get from one to the other is to install javacc and
use the various javacc ant targets in lucene's build.xml.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Handling of colons in QueryParserTokenManager

Posted by Gwyn Carwardine <gw...@carwardine.net>.

Hi Otis, sorry if I posted to the wrong group. I though user was for
usage-type queries and dev was for development-type queries. As I was asking
about changing the code itself (rather than about interfacing with it) I
assumed this was a dev forum issue. I'm still a bit confused.. can you tell
me what the distinction is? I have another issue (where I've hacked the
Lucene code and I want to discuss whether it's a valid hack or not and
possibly how to do it properly) which I would like to raise and I'm confused
as to where to raise it!

Anyway, back to the matter in hand:

At the moment if I use abc:def:123 I would expect my custom analyzer to
receive field(abc) value(def:123) but it's receiving field(abc) value(def).
somewhere along the line query parse is throwing away the 123. Which I think
is the wrong behaviour... But anyway, what I don't know is where to make a
change to (for all fields) pass the second colon through to the analyzer
where the analyzer can make the decision about what to do with it. As it
would do it I entered abc:def.123 or abc:def;123 

I can override the analyzer but I can't override the query parser behaviour
very easily. I don't understand where the post-colon text is being
discarded. In QueryParser or in QueryPArserTokenManager?

And now I've been pointed to QueryParser.jj I wonder what language that is?
And is QueryParser.javaj create from it? If so how does it get from one to
the other?! Help! 

Cheers, Gwyn



-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: 23 January 2006 01:54
To: 
Cc: Gwyn Carwardine
Subject: Re: Handling of colons in QueryParserTokenManager

Gwyn - this is a question for java-user@ list, I'm answering there.

Perhaps you can write your own QueryParser.jj variant and change it so it
has explicit knowledge of your indef fields.  Then, if it parses "foo:...."
and "foo" is not one of the fields it knows about, it could escape the
column character on behalf of the user.

If you do this, and it works out, plese share a patch.  This issue was just
raised on one of the projects where I'm using Lucene, and this was going to
be my first way of dealing with the problem.

Otis

----- Original Message ----
From: Gwyn Carwardine <gw...@carwardine.net>
To: java-dev@lucene.apache.org
Sent: Sat 21 Jan 2006 08:10:56 AM EST
Subject: Handling of colons in QueryParserTokenManager

Hello, I'm new here. I've actually started using dotLucene but I think I
need to make a change to the QueryParser but it's so complicated to try and
understand what it's doing I thought I'd ask if maybe one of you guys could
point me in the right direction?

In my implementation of Lucene I have the need to store keywords that are of
the form "<key>:<identity>" for example CI:123. Whilst I can store this in
Lucene using Field.Keyword("ID","CI:123") I can't easily look it up by using
QueryParser which I need to do.

Whenever I parse the query ID:CI:123 it parses it as "ID:ci". Now I've
already made a small hack so that non-tokenized values are indexed as
lowercase so at least I can get them back if I use ID:CI\:123 but colons are
commonly used and I really don't want to have to escape them everywhere

What I want to achieve is that query parser will parse ID:CI:123 as
field(ID) value(CI:123). I understand that colon is a special character but
it's only used to delimit fields and values in which case it makes sense to
react to the first colon, the second colon should be treated as part of the
text which the analyzer could strip out or keep (in my case because I'm
using a custom analyzer).

Does this make sense? How do I go about changing the QueryParserTokenManager
to achieve this? Perhaps you can point me to some documentation that
describes the code even?

Any help gratefully received!

Thanks,
Gwyn Carwardine


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling of colons in QueryParserTokenManager

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Gwyn - this is a question for java-user@ list, I'm answering there.

Perhaps you can write your own QueryParser.jj variant and change it so it has explicit knowledge of your indef fields.  Then, if it parses "foo:...." and "foo" is not one of the fields it knows about, it could escape the column character on behalf of the user.

If you do this, and it works out, plese share a patch.  This issue was just raised on one of the projects where I'm using Lucene, and this was going to be my first way of dealing with the problem.

Otis

----- Original Message ----
From: Gwyn Carwardine <gw...@carwardine.net>
To: java-dev@lucene.apache.org
Sent: Sat 21 Jan 2006 08:10:56 AM EST
Subject: Handling of colons in QueryParserTokenManager

Hello, I'm new here. I've actually started using dotLucene but I think I
need to make a change to the QueryParser but it's so complicated to try and
understand what it's doing I thought I'd ask if maybe one of you guys could
point me in the right direction?

In my implementation of Lucene I have the need to store keywords that are of
the form "<key>:<identity>" for example CI:123. Whilst I can store this in
Lucene using Field.Keyword("ID","CI:123") I can't easily look it up by using
QueryParser which I need to do.

Whenever I parse the query ID:CI:123 it parses it as "ID:ci". Now I've
already made a small hack so that non-tokenized values are indexed as
lowercase so at least I can get them back if I use ID:CI\:123 but colons are
commonly used and I really don't want to have to escape them everywhere

What I want to achieve is that query parser will parse ID:CI:123 as
field(ID) value(CI:123). I understand that colon is a special character but
it's only used to delimit fields and values in which case it makes sense to
react to the first colon, the second colon should be treated as part of the
text which the analyzer could strip out or keep (in my case because I'm
using a custom analyzer).

Does this make sense? How do I go about changing the QueryParserTokenManager
to achieve this? Perhaps you can point me to some documentation that
describes the code even?

Any help gratefully received!

Thanks,
Gwyn Carwardine


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org