You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Victor Hadianto <vi...@nuix.com.au> on 2003/07/10 06:06:34 UTC

Re: '-' character not interpreted correctly in field names (solution)

Eric and others,

I finally found a solution for this problem, although it is really specific to 
our need.

The simplest solution in the end is redefining what a "Term" is about. At the 
moment, QueryParser will parse the following:

t-shirt as

+t -shirt

Which, in my opinion, is not really acceptable. A more sensible parsing will 
parse "t-shirt" as "t-shirt". If a user wants to do a query for "t" without 
the word "shirt" on it then the query should really be:

t -shirt
 ^ space here.

Similarly, a field query such as:

model:t-shirt

should really be interpreted as "model:t-shirt" not +model:t -shirt. I this it 
really make more sense to have the requirement of having a space before the 
"-" to identify a NOT query.

Onward to the code change, as I have said earlier it is specific for our 
application use and thus may not be relevant to most other people. Some of 
our field name have the "-" sign in it. Thus by changing the TERM_CHAR 
definition to:

<#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >

makes QueryParser compatible with our need. 


Cheers,

Victor


On Thu, 10 Jul 2003 10:11 am, Victor Hadianto wrote:
> Yep tried that. Actually there is more to the creation of the field than
> just in this line:
>
> fieldToken=<TERM> <COLON> { field = fieldToken.image; }
>
>
> Because I've created a <FIELDNAME> which is exactly the same with <TERM>
> which
>
> looks like this:
> | <FIELDNAME: <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
>
> and change fieldToken to:
>
> fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
>
> And it doesn't work. Simple query such as to:tom* is parsed as blank query.
>
> I will continue looking at this problem and will post my solution if I get
> it, in the mean time I really do appreciate any help and suggestions.
>
> cheers,
>
> victor
>
> On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote:
> > You left out the ~ character in your _FIELDNAME_START_CHAR production.
> > That character tells the grammar that it should take all the characters
> > except the ones you specified (the complement).
> >
> > Change:
> > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
> >
> > To:
> > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
> >
> > and it should probably work.
> >
> > Eric
> >
> > -----Original Message-----
> > From: Victor Hadianto [mailto:victorh@nuix.com.au]
> > Sent: Wednesday, July 09, 2003 4:53 AM
> > To: Lucene Users List
> > Subject: Re: '-' character not interpreted correctly in field names
> >
> >
> > Hi Erik and others,
> >
> > I'm looking for a similar solution where I need QueryParser not to drop
> > the "-" characters from the field name. Hower outside the field I do want
> > the - sign interpreted as "not" modifier.
> >
> > I'm definitely not an expert in JavaCC and to be honest I only have a
> > limited idea about Erik's suggestion work,
> >
> > Anyway I followed the suggestion and added the following:
> > | <#_WHITESPACE: ( " " | "\t" ) >
> > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
> > | "^",
> >
> >                                "[", "]", "\"", "{", "}", "~", "*", "?" ]
> >
> >                              | <_ESCAPED_CHAR> ) >
> > |
> > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >
> >
> > and again below I added:
> > | <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> > | <FIELDNAME: <_FIELDNAME_START_CHAR> (<_FIELDNAME_CHAR>)*  >
> >
> > And I changed:
> >
> >     LOOKAHEAD(2)
> >     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> >
> > to: ...
> >
> >     LOOKAHEAD(2)
> >     fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
> >
> >
> > Well after doing all this mods all the query that involved field names
> > cause problem, for example if I searched for
> >
> > fieldname:hello
> >
> > The query is blank (yes blank, nothing in it)
> >
> > and if the fieldname does contain a dash ("-") for example:
> > field-name:hello
> >
> > They query is: +field -name
> >
> > hello is dropped.
> >
> >
> > Does anyone has any idea? Help and suggestions will be much appreciated.
> > I really need to get this dash working, changing the field name will be
> > my last resort which I won't explore until I really have to.
> >
> >
> > Thanks,
> >
> > Victor
> >
> > On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> > > I think the query parser changes would not be too bad, I've outlined a
> > > couple of relavant lines you should look at so you don't have to try
> > > and comprehend the productions for the entire QueryParser. I do not
> > > think I would like to have to maintain one of those myself though.
> > > Your other unmentioned alternative is to choose field names that match
> > > the <TERM> production of QueryParser.jj without escapes.
> > >
> > > QueryParser.jj line 557:
> > >     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> > >
> > > and earlier...
> > >  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
> > >                           "[", "]", "\"", "{", "}", "~", "*", "?" ] >
> > >
> > > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
> > > | "^",
> > >
> > >                            "[", "]", "\"", "{", "}", "~", "*", "?" ]
> > >
> > >                        | <_ESCAPED_CHAR> ) >
> > > |
> > > | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
> > >
> > > ...
> > >
> > > <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> > >
> > > So the characters you need to avoid in your field names are the ones
> > > from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[",
> > > "]", "\"", "{", "}", "~", "*", "?" ]
> > >
> > > If you need to modify the parser, you will probably want to add a
> > > FIELDNAME token and other supporting productions that look really
> > > similar to these lines I've copied but modify the complement, ~[...],
> > > at the beginning of _FIELDNAME_START_CHAR (you would add this
> > > production) so it will match the "-" that you are using in your field
> > > names (and fix it to match any other characters you want to use in
> > > field names that it doesn't allow right now).
> > >
> > > Eric
> > >
> > > -----Original Message-----
> > > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > > Sent: Wednesday, May 14, 2003 2:26 PM
> > > To: Lucene Users List
> > > Subject: Re: '-' character not interpreted correctly in field names
> > >
> > > Eric Isakson wrote:
> > > > I just looked at the QueryParser.jj code, your field names
> > > >
> > >  > never get processed by the analyzer. It does look like the  > query
> > >
> > > parser will honor escapes though. I haven't tried  > this, but try a
> > > query like "foo\-bar:foo" and have
> > >
> > > > a look at the QueryParser.jj file for how it handles field
> > > >
> > >  > names when parsing your query.
> > >
> > > Hrm.. that's what I had found too.  So, you're saying that, other than
> > > escaping dashes, I'd have to change QueryParser.. ?
> > >
> > > I'm not too familiar just yet with JavaCC syntax, so reading through
> > > QueryParser is a little tough going.  Thanks Eric,
> > >
> > > jp
> > >
> > > > -----Original Message-----
> > > > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > > > Sent: Monday, May 12, 2003 4:03 PM
> > > > To: Lucene Users List
> > > > Subject: Re: '-' character not interpreted correctly in field names
> > > >
> > > >
> > > > Hi Otis, Terry,
> > > >
> > > >  >>>You can write a custom Analyzer that does not remove dashes from
> > > > >>>
> > > > >>>tokens, and use it for both indexing and searching.  >>>  >>>This
> > > >
> > > > is a frequent question and answer on this list.
> > > >
> > > > Sorry for the noise, but I haven't been able to find a solution in
> > > > the mailing list archives, or by writing my own analyzer:
> > > >
> > > > 	public class MyAnalyzer extends Analyzer {
> > > > 	public TokenStream tokenStream(String fieldName, Reader reader) 		{
> > > > 		return new CharTokenizer(reader) {
> > > > 			protected boolean isTokenChar(char c) {
> > > > 				return Character.isLetter(c) || c == '-';
> > > > 			}
> > > > 		};
> > > > 	}
> > > > 	}
> > > >
> > > > I parse a query like this:
> > > >
> > > > 	String queryString = "foo-bar:foo";
> > > > 	String queryResult =
> > > > 		QueryParser.parse(queryString, "body", new MyAnalyzer())
> > > >
> > > > With the output:
> > > > 	body:foo -bar:foo
> > > >
> > > > But I would expect the output:
> > > > 	 foo-bar:foo
> > > >
> > > > If I print out the tokens that MyAnalyzer produces I do get
> > > > "foo-bar" and then "foo".
> > > >
> > > > Any pointers on what I'm doing wrong?
> > > >
> > > > jp
> > > >
> > > >>>>--- Jon Pipitone <jp...@mshri.on.ca> wrote:
> > > >>>>>Hi all,
> > > >>>>>
> > > >>>>>>I believe that the tokenizer treats a dash as a token
> > > >>>
> > > >>>separator.
> > > >>>
> > > >>>>>>Hence, the only way, as I recall, to eliminate this behavior
> > > >>>
> > > >>>is
> > > >>>
> > > >>>>>>to modify QueryParser.jj so it doesn't do this.  However,
> > > >>>
> > > >>>doing
> > > >>>
> > > >>>>>>this can cause some other problems, like hyphenated words at a
> > > >>>>>>line break and the like.
> > > >>>>>
> > > >>>>>I've recently started using lucene and I'm running into the same
> > > >>>>>issue with the query parser.  I'd like to use queries that
> > > >>>>>contain
> > > >>>
> > > >>>dashes
> > > >>>
> > > >>>>>in
> > > >>>>>the field name, but as far as I can tell it seems that the
> > > >>>
> > > >>>current
> > > >>>
> > > >>>>>query
> > > >>>>>grammar treats field names as terms, and so, as Terry notes, a
> > > >>>
> > > >>>dash
> > > >>>
> > > >>>>>becomes a token seperator.
> > > >>>>>
> > > >>>>>Terry suggests modifying the QueryParser.jj -- I would suspect by
> > > >>>>>creating a seperate non-terminal for field names.
> > > >>>>>
> > > >>>>>Has anyone done any work on this already?  Is modifying
> > > >>>>>QueryParser.jj the best approach?
> > > >>>>>
> > > >>>>>Thanks,
> > > >>>>>jp
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: '-' character not interpreted correctly in field names (solution)

Posted by Jan Agermose <ja...@agermose.dk>.

+1

----- Original Message ----- 
From: "Eric Jain" <Er...@isb-sib.ch>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, July 10, 2003 12:53 PM
Subject: Re: '-' character not interpreted correctly in field names
(solution)


> > I think this is a fine change, that others would welcome, too.
> > No?
> > Does your change work with queries that start with a '-' character?
> > For example: -shirt +pants
> > (note: no space before '-shirt')
> >
> > If so, I think we could include this change in QueryParser.jj if you
> > send the diff, as I recall others wondering why queries like t-shirt
> > get misinterpreted as +t -shirt.
>
> +1
>
> --
> Eric Jain
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: '-' character not interpreted correctly in field names (solution)

Posted by Eric Jain <Er...@isb-sib.ch>.

> I think this is a fine change, that others would welcome, too.
> No?
> Does your change work with queries that start with a '-' character?
> For example: -shirt +pants
> (note: no space before '-shirt')
> 
> If so, I think we could include this change in QueryParser.jj if you
> send the diff, as I recall others wondering why queries like t-shirt
> get misinterpreted as +t -shirt.

+1

--
Eric Jain


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: '-' character not interpreted correctly in field names (solution)

Posted by Victor Hadianto <vi...@nuix.com.au>.

Okay attached is the diff file to allow t-shirt to be interpreted as 
"t-shirt". Queries that start with a "-" character behave as expected, well 
at least as we expected. 

For example: -shirt +pants as -shirt +pants

One thing I need to mention is (I dig this from earlier discussion in this 
list), that Doug Cutting said this (about the similar change someone else 
propose):

<---- cut --->
Lixin Meng wrote:
> Therefore, it would be preferable to treat all hyphen in the same way.
> Either as a delimiter or as part of the word (maybe with a flag at the API).

If we change StandardTokenizer in this way then we risk breaking all the 
applications that currently use it and depend on its current behaviour. 
  So I'm reluctant to make this change.

 From the StandardTokenizer documentation:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html

"Many applications have specific tokenizer needs. If this tokenizer does 
not suit your application, please consider copying this source code 
directory to your project and maintaining your own grammar-based tokenizer."

Also, if you construct a tokenizer that you think is more generally 
useful than StandardTokenizer, please contribute it by mailing it to one 
of the Lucene mailing lists.

Thanks,

Doug

<---- cut ---->

So yes this change _may_ break other exisiting applications.

cheers,
victor


On Thu, 10 Jul 2003 08:34 pm, Otis Gospodnetic wrote:
> I think this is a fine change, that others would welcome, too.
> No?
> Does your change work with queries that start with a '-' character?
> For example: -shirt +pants
> (note: no space before '-shirt')
>
> If so, I think we could include this change in QueryParser.jj if you
> send the diff, as I recall others wondering why queries like t-shirt
> get misinterpreted as +t -shirt.
>
> Thanks,
> Otis
>
> --- Victor Hadianto <vi...@nuix.com.au> wrote:
> > Eric and others,
> >
> > I finally found a solution for this problem, although it is really
> > specific to
> > our need.
> >
> > The simplest solution in the end is redefining what a "Term" is
> > about. At the
> > moment, QueryParser will parse the following:
> >
> > t-shirt as
> >
> > +t -shirt
> >
> > Which, in my opinion, is not really acceptable. A more sensible
> > parsing will
> > parse "t-shirt" as "t-shirt". If a user wants to do a query for "t"
> > without
> > the word "shirt" on it then the query should really be:
> >
> > t -shirt
> >  ^ space here.
> >
> > Similarly, a field query such as:
> >
> > model:t-shirt
> >
> > should really be interpreted as "model:t-shirt" not +model:t -shirt.
> > I this it
> > really make more sense to have the requirement of having a space
> > before the
> > "-" to identify a NOT query.
> >
> > Onward to the code change, as I have said earlier it is specific for
> > our
> > application use and thus may not be relevant to most other people.
> > Some of
> > our field name have the "-" sign in it. Thus by changing the
> > TERM_CHAR
> > definition to:
> >
> > <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >
> >
> > makes QueryParser compatible with our need.
> >
> >
> > Cheers,
> >
> > Victor
> >

Re: NOT, exclude, prohibit, !, -

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

I believe those are all the same.  I am writing this from an Internet
cafe in Sofia, Bulgaria, so I don't have access to QueryParser.jj, but
that's where you should look to see how NOT, ! and - are defined.
I believe they all exclude from the result, all documents that contain
the term or phrase that follows.

Otis


--- Jon Crowell <jc...@dsg.harvard.edu> wrote:
> Hi.  I've been reading the Query Syntax page at
> http://jakarta.apache.org/lucene/docs/queryparsersyntax.html and I'm
> not
> sure I understand the difference between the prohibit operator (-)
> and the
> exclude operator (!).
> 
> It seems that NOT is the exclude operator and the short form is !.  I
> quote:
> 
> 
>      The NOT operator excludes documents that contain the
>      term after NOT. This is equivalent to a difference
>      using sets. The symbol ! can be used in place of the
>      word NOT.
> 
>      To search for documents that contain "jakarta apache"
>      but not "jakarta lucene" use the query: 
> 
>     "jakarta apache" NOT "jakarta lucene"
> 
> 
> The minus sign (-) is described in its own section as the prohibit
> operator.
> I quote:
> 
> 
>      The "-" or prohibit operator excludes documents that
>      contain the term after the "-" symbol.
> 
>      To search for documents that contain "jakarta apache"
>      but not "jakarta lucene" use the query: 
>      
>      "jakarta apache" -"jakarta lucene"
> 
> 
> My question is: what is the difference between these two operators? 
> If
> there is no difference, then why are there two operators?
> 
> Thanks,
> 
> Jon
> 
> 
>    
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

NOT, exclude, prohibit, !, -

Posted by Jon Crowell <jc...@dsg.harvard.edu>.

Hi.  I've been reading the Query Syntax page at
http://jakarta.apache.org/lucene/docs/queryparsersyntax.html and I'm not
sure I understand the difference between the prohibit operator (-) and the
exclude operator (!).

It seems that NOT is the exclude operator and the short form is !.  I quote:


     The NOT operator excludes documents that contain the
     term after NOT. This is equivalent to a difference
     using sets. The symbol ! can be used in place of the
     word NOT.

     To search for documents that contain "jakarta apache"
     but not "jakarta lucene" use the query: 

    "jakarta apache" NOT "jakarta lucene"


The minus sign (-) is described in its own section as the prohibit operator.
I quote:


     The "-" or prohibit operator excludes documents that
     contain the term after the "-" symbol.

     To search for documents that contain "jakarta apache"
     but not "jakarta lucene" use the query: 
     
     "jakarta apache" -"jakarta lucene"


My question is: what is the difference between these two operators?  If
there is no difference, then why are there two operators?

Thanks,

Jon


   




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: '-' character not interpreted correctly in field names (solution)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I think this is a fine change, that others would welcome, too.
No?
Does your change work with queries that start with a '-' character?
For example: -shirt +pants
(note: no space before '-shirt')

If so, I think we could include this change in QueryParser.jj if you
send the diff, as I recall others wondering why queries like t-shirt
get misinterpreted as +t -shirt.

Thanks,
Otis

--- Victor Hadianto <vi...@nuix.com.au> wrote:
> Eric and others,
> 
> I finally found a solution for this problem, although it is really
> specific to 
> our need.
> 
> The simplest solution in the end is redefining what a "Term" is
> about. At the 
> moment, QueryParser will parse the following:
> 
> t-shirt as
> 
> +t -shirt
> 
> Which, in my opinion, is not really acceptable. A more sensible
> parsing will 
> parse "t-shirt" as "t-shirt". If a user wants to do a query for "t"
> without 
> the word "shirt" on it then the query should really be:
> 
> t -shirt
>  ^ space here.
> 
> Similarly, a field query such as:
> 
> model:t-shirt
> 
> should really be interpreted as "model:t-shirt" not +model:t -shirt.
> I this it 
> really make more sense to have the requirement of having a space
> before the 
> "-" to identify a NOT query.
> 
> Onward to the code change, as I have said earlier it is specific for
> our 
> application use and thus may not be relevant to most other people.
> Some of 
> our field name have the "-" sign in it. Thus by changing the
> TERM_CHAR 
> definition to:
> 
> <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >
> 
> makes QueryParser compatible with our need. 
> 
> 
> Cheers,
> 
> Victor
> 
> 
> On Thu, 10 Jul 2003 10:11 am, Victor Hadianto wrote:
> > Yep tried that. Actually there is more to the creation of the field
> than
> > just in this line:
> >
> > fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> >
> >
> > Because I've created a <FIELDNAME> which is exactly the same with
> <TERM>
> > which
> >
> > looks like this:
> > | <FIELDNAME: <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> >
> > and change fieldToken to:
> >
> > fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
> >
> > And it doesn't work. Simple query such as to:tom* is parsed as
> blank query.
> >
> > I will continue looking at this problem and will post my solution
> if I get
> > it, in the mean time I really do appreciate any help and
> suggestions.
> >
> > cheers,
> >
> > victor
> >
> > On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote:
> > > You left out the ~ character in your _FIELDNAME_START_CHAR
> production.
> > > That character tells the grammar that it should take all the
> characters
> > > except the ones you specified (the complement).
> > >
> > > Change:
> > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > >
> > > To:
> > > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > >
> > > and it should probably work.
> > >
> > > Eric
> > >
> > > -----Original Message-----
> > > From: Victor Hadianto [mailto:victorh@nuix.com.au]
> > > Sent: Wednesday, July 09, 2003 4:53 AM
> > > To: Lucene Users List
> > > Subject: Re: '-' character not interpreted correctly in field
> names
> > >
> > >
> > > Hi Erik and others,
> > >
> > > I'm looking for a similar solution where I need QueryParser not
> to drop
> > > the "-" characters from the field name. Hower outside the field I
> do want
> > > the - sign interpreted as "not" modifier.
> > >
> > > I'm definitely not an expert in JavaCC and to be honest I only
> have a
> > > limited idea about Erik's suggestion work,
> > >
> > > Anyway I followed the suggestion and added the following:
> > > | <#_WHITESPACE: ( " " | "\t" ) >
> > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > > | "^",
> > >
> > >                                "[", "]", "\"", "{", "}", "~",
> "*", "?" ]
> > >
> > >                              | <_ESCAPED_CHAR> ) >
> > > |
> > > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR>
> ) >
> > >
> > > and again below I added:
> > > | <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> > > | <FIELDNAME: <_FIELDNAME_START_CHAR> (<_FIELDNAME_CHAR>)*  >
> > >
> > > And I changed:
> > >
> > >     LOOKAHEAD(2)
> > >     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> > >
> > > to: ...
> > >
> > >     LOOKAHEAD(2)
> > >     fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
> > >
> > >
> > > Well after doing all this mods all the query that involved field
> names
> > > cause problem, for example if I searched for
> > >
> > > fieldname:hello
> > >
> > > The query is blank (yes blank, nothing in it)
> > >
> > > and if the fieldname does contain a dash ("-") for example:
> > > field-name:hello
> > >
> > > They query is: +field -name
> > >
> > > hello is dropped.
> > >
> > >
> > > Does anyone has any idea? Help and suggestions will be much
> appreciated.
> > > I really need to get this dash working, changing the field name
> will be
> > > my last resort which I won't explore until I really have to.
> > >
> > >
> > > Thanks,
> > >
> > > Victor
> > >
> > > On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> > > > I think the query parser changes would not be too bad, I've
> outlined a
> > > > couple of relavant lines you should look at so you don't have
> to try
> > > > and comprehend the productions for the entire QueryParser. I do
> not
> > > > think I would like to have to maintain one of those myself
> though.
> > > > Your other unmentioned alternative is to choose field names
> that match
> > > > the <TERM> production of QueryParser.jj without escapes.
> > > >
> > > > QueryParser.jj line 557:
> > > >     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> > > >
> > > > and earlier...
> > > >  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":",
> "^",
> > > >                           "[", "]", "\"", "{", "}", "~", "*",
> "?" ] >
> > > >
> > > > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")",
> ":",
> > > > | "^",
> > > >
> > > >                            "[", "]", "\"", "{", "}", "~", "*",
> "?" ]
> > > >
> > > >                        | <_ESCAPED_CHAR> ) >
> > > > |
> > > > | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
> > > >
> > > > ...
> > > >
> > > > <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> > > >
> > > > So the characters you need to avoid in your field names are the
> ones
> > > > from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^",
> "[",
> > > > "]", "\"", "{", "}", "~", "*", "?" ]
> > > >
> > > > If you need to modify the parser, you will probably want to add
> a
> > > > FIELDNAME token and other supporting productions that look
> really
> > > > similar to these lines I've copied but modify the complement,
> ~[...],
> > > > at the beginning of _FIELDNAME_START_CHAR (you would add this
> > > > production) so it will match the "-" that you are using in your
> field
> > > > names (and fix it to match any other characters you want to use
> in
> > > > field names that it doesn't allow right now).
> > > >
> > > > Eric
> > > >
> > > > -----Original Message-----
> > > > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > > > Sent: Wednesday, May 14, 2003 2:26 PM
> > > > To: Lucene Users List
> > > > Subject: Re: '-' character not interpreted correctly in field
> names
> > > >
> > > > Eric Isakson wrote:
> > > > > I just looked at the QueryParser.jj code, your field names
> > > > >
> > > >  > never get processed by the analyzer. It does look like the 
> > query
> > > >
> > > > parser will honor escapes though. I haven't tried  > this, but
> try a
> > > > query like "foo\-bar:foo" and have
> > > >
> > > > > a look at the QueryParser.jj file for how it handles field
> > > > >
> > > >  > names when parsing your query.
> > > >
> > > > Hrm.. that's what I had found too.  So, you're saying that,
> other than
> > > > escaping dashes, I'd have to change QueryParser.. ?
> > > >
> > > > I'm not too familiar just yet with JavaCC syntax, so reading
> through
> > > > QueryParser is a little tough going.  Thanks Eric,
> > > >
> > > > jp
> > > >
> > > > > -----Original Message-----
> > > > > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > > > > Sent: Monday, May 12, 2003 4:03 PM
> > > > > To: Lucene Users List
> > > > > Subject: Re: '-' character not interpreted correctly in field
> names
> > > > >
> > > > >
> > > > > Hi Otis, Terry,
> > > > >
> > > > >  >>>You can write a custom Analyzer that does not remove
> dashes from
> > > > > >>>
> > > > > >>>tokens, and use it for both indexing and searching.  >>> 
> >>>This
> > > > >
> > > > > is a frequent question and answer on this list.
> > > > >
> > > > > Sorry for the noise, but I haven't been able to find a
> solution in
> > > > > the mailing list archives, or by writing my own analyzer:
> > > > >
> > > > > 	public class MyAnalyzer extends Analyzer {
> > > > > 	public TokenStream tokenStream(String fieldName, Reader
> reader) 		{
> > > > > 		return new CharTokenizer(reader) {
> > > > > 			protected boolean isTokenChar(char c) {
> > > > > 				return Character.isLetter(c) || c == '-';
> > > > > 			}
> > > > > 		};
> > > > > 	}
> > > > > 	}
> > > > >
> > > > > I parse a query like this:
> > > > >
> > > > > 	String queryString = "foo-bar:foo";
> > > > > 	String queryResult =
> > > > > 		QueryParser.parse(queryString, "body", new MyAnalyzer())
> > > > >
> > > > > With the output:
> > > > > 	body:foo -bar:foo
> > > > >
> > > > > But I would expect the output:
> > > > > 	 foo-bar:foo
> > > > >
> > > > > If I print out the tokens that MyAnalyzer produces I do get
> > > > > "foo-bar" and then "foo".
> > > > >
> > > > > Any pointers on what I'm doing wrong?
> > > > >
> > > > > jp
> > > > >
> > > > >>>>--- Jon Pipitone <jp...@mshri.on.ca> wrote:
> > > > >>>>>Hi all,
> > > > >>>>>
> > > > >>>>>>I believe that the tokenizer treats a dash as a token
> > > > >>>
> > > > >>>separator.
> > > > >>>
> > > > >>>>>>Hence, the only way, as I recall, to eliminate this
> behavior
> > > > >>>
> > > > >>>is
> > > > >>>
> > > > >>>>>>to modify QueryParser.jj so it doesn't do this.  However,
> > > > >>>
> > > > >>>doing
> > > > >>>
> > > > >>>>>>this can cause some other problems, like hyphenated words
> at a
> > > > >>>>>>line break and the like.
> > > > >>>>>
> > > > >>>>>I've recently started using lucene and I'm running into
> the same
> > > > >>>>>issue with the query parser.  I'd like to use queries that
> > > > >>>>>contain
> > > > >>>
> > > > >>>dashes
> > > > >>>
> > > > >>>>>in
> > > > >>>>>the field name, but as far as I can tell it seems that the
> > > > >>>
> > > > >>>current
> > > > >>>
> > > > >>>>>query
> > > > >>>>>grammar treats field names as terms, and so, as Terry
> notes, a
> > > > >>>
> > > > >>>dash
> > > > >>>
> > > > >>>>>becomes a token seperator.
> > > > >>>>>
> > > > >>>>>Terry suggests modifying the QueryParser.jj -- I would
> suspect by
> > > > >>>>>creating a seperate non-terminal for field names.
> > > > >>>>>
> > > > >>>>>Has anyone done any work on this already?  Is modifying
> > > > >>>>>QueryParser.jj the best approach?
> > > > >>>>>
> > > > >>>>>Thanks,
> > > > >>>>>jp
> > >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org