You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Victor Hadianto <vi...@nuix.com.au> on 2003/07/09 10:52:42 UTC

Re: '-' character not interpreted correctly in field names

Hi Erik and others,

I'm looking for a similar solution where I need QueryParser not to drop the 
"-" characters from the field name. Hower outside the field I do want the - 
sign interpreted as "not" modifier. 

I'm definitely not an expert in JavaCC and to be honest I only have a limited 
idea about Erik's suggestion work,

Anyway I followed the suggestion and added the following:

| <#_WHITESPACE: ( " " | "\t" ) >
| <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
                               "[", "]", "\"", "{", "}", "~", "*", "?" ]
                             | <_ESCAPED_CHAR> ) >
| <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >

and again below I added:


| <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
| <FIELDNAME: <_FIELDNAME_START_CHAR> (<_FIELDNAME_CHAR>)*  >

And I changed:

    LOOKAHEAD(2)
    fieldToken=<TERM> <COLON> { field = fieldToken.image; }

to: ...

    LOOKAHEAD(2)
    fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }


Well after doing all this mods all the query that involved field names cause 
problem, for example if I searched for

fieldname:hello

The query is blank (yes blank, nothing in it)

and if the fieldname does contain a dash ("-") for example: field-name:hello

They query is: +field -name

hello is dropped.


Does anyone has any idea? Help and suggestions will be much appreciated. I 
really need to get this dash working, changing the field name will be my last 
resort which I won't explore until I really have to.


Thanks,

Victor


On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> I think the query parser changes would not be too bad, I've outlined a
> couple of relavant lines you should look at so you don't have to try and
> comprehend the productions for the entire QueryParser. I do not think I
> would like to have to maintain one of those myself though. Your other
> unmentioned alternative is to choose field names that match the <TERM>
> production of QueryParser.jj without escapes.
>
> QueryParser.jj line 557:
>     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
>
> and earlier...
>  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
>                           "[", "]", "\"", "{", "}", "~", "*", "?" ] >
>
> | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
>
>                            "[", "]", "\"", "{", "}", "~", "*", "?" ]
>
>                        | <_ESCAPED_CHAR> ) >
> |
> | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
>
> ...
>
> <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
>
> So the characters you need to avoid in your field names are the ones from
> _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"",
> "{", "}", "~", "*", "?" ]
>
> If you need to modify the parser, you will probably want to add a FIELDNAME
> token and other supporting productions that look really similar to these
> lines I've copied but modify the complement, ~[...], at the beginning of
> _FIELDNAME_START_CHAR (you would add this production) so it will match the
> "-" that you are using in your field names (and fix it to match any other
> characters you want to use in field names that it doesn't allow right now).
>
> Eric
>
> -----Original Message-----
> From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> Sent: Wednesday, May 14, 2003 2:26 PM
> To: Lucene Users List
> Subject: Re: '-' character not interpreted correctly in field names
>
> Eric Isakson wrote:
> > I just looked at the QueryParser.jj code, your field names
> >
>  > never get processed by the analyzer. It does look like the
>  > query parser will honor escapes though. I haven't tried
>  > this, but try a query like "foo\-bar:foo" and have
> >
> > a look at the QueryParser.jj file for how it handles field
> >
>  > names when parsing your query.
>
> Hrm.. that's what I had found too.  So, you're saying that, other than
> escaping dashes, I'd have to change QueryParser.. ?
>
> I'm not too familiar just yet with JavaCC syntax, so reading through
> QueryParser is a little tough going.  Thanks Eric,
>
> jp
>
> > -----Original Message-----
> > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > Sent: Monday, May 12, 2003 4:03 PM
> > To: Lucene Users List
> > Subject: Re: '-' character not interpreted correctly in field names
> >
> >
> > Hi Otis, Terry,
> >
> >  >>>You can write a custom Analyzer that does not remove dashes from
> > >>>
> > >>>tokens, and use it for both indexing and searching.  >>>  >>>This
> >
> > is a frequent question and answer on this list.
> >
> > Sorry for the noise, but I haven't been able to find a solution in the
> > mailing list archives, or by writing my own analyzer:
> >
> > 	public class MyAnalyzer extends Analyzer {
> > 	public TokenStream tokenStream(String fieldName, Reader reader) 		{
> > 		return new CharTokenizer(reader) {
> > 			protected boolean isTokenChar(char c) {
> > 				return Character.isLetter(c) || c == '-';
> > 			}
> > 		};
> > 	}
> > 	}
> >
> > I parse a query like this:
> >
> > 	String queryString = "foo-bar:foo";
> > 	String queryResult =
> > 		QueryParser.parse(queryString, "body", new MyAnalyzer())
> >
> > With the output:
> > 	body:foo -bar:foo
> >
> > But I would expect the output:
> > 	 foo-bar:foo
> >
> > If I print out the tokens that MyAnalyzer produces I do get "foo-bar"
> > and then "foo".
> >
> > Any pointers on what I'm doing wrong?
> >
> > jp
> >
> >>>>--- Jon Pipitone <jp...@mshri.on.ca> wrote:
> >>>>>Hi all,
> >>>>>
> >>>>>>I believe that the tokenizer treats a dash as a token
> >>>
> >>>separator.
> >>>
> >>>>>>Hence, the only way, as I recall, to eliminate this behavior
> >>>
> >>>is
> >>>
> >>>>>>to modify QueryParser.jj so it doesn't do this.  However,
> >>>
> >>>doing
> >>>
> >>>>>>this can cause some other problems, like hyphenated words at a
> >>>>>>line break and the like.
> >>>>>
> >>>>>I've recently started using lucene and I'm running into the same
> >>>>>issue with the query parser.  I'd like to use queries that contain
> >>>
> >>>dashes
> >>>
> >>>>>in
> >>>>>the field name, but as far as I can tell it seems that the
> >>>
> >>>current
> >>>
> >>>>>query
> >>>>>grammar treats field names as terms, and so, as Terry notes, a
> >>>
> >>>dash
> >>>
> >>>>>becomes a token seperator.
> >>>>>
> >>>>>Terry suggests modifying the QueryParser.jj -- I would suspect by
> >>>>>creating a seperate non-terminal for field names.
> >>>>>
> >>>>>Has anyone done any work on this already?  Is modifying
> >>>>>QueryParser.jj the best approach?
> >>>>>
> >>>>>Thanks,
> >>>>>jp



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org