You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stefan Bergstrand <st...@polopoly.com> on 2002/01/14 08:06:52 UTC

Parsing of queries.

I add a field to a document using:

doc.add(Field.Text("path", "d=100&a=102"));


When I search for the document using "d=100&a=102" as the query using:

    public static void main(String[] args){

	String indexDir = args[0];
	String queryStr = args[1];

	System.out.println("indexDir = " + indexDir);
	System.out.println("query    = " + queryStr);

	IndexSearcher searcher = new IndexSearcher(indexDir);
	
	Term term = new Term("path", queryStr);
	TermQuery query = new TermQuery(term);
	
	Hits hits = searcher.search(query);
	
	
	if (hits.length() == 0){
	    System.out.println("length = 0");
	}
	
    }


it returns nothing. If I use "SearchFiles" (the search example that
comes with the Lucene dist) I get:

Query: d=100&a=102
Exception in thread "main" org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column 7.  Encountered: "a" (97), after : "&"
        at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.jj_scan_token(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.jj_3_1(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.jj_2_1(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.Clause(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
        at Lucsearch.main(Lucsearch.java:36)


How do I format the query in order to keep the parser satisfied? I
have tried the usual \-escaping of difficult characters, but that
doesn't work either. Is there a way to set which characters are
allowed in a query or something similar?

/Stefan Bergstrand

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Parsing of queries.

Posted by Brian Goetz <br...@quiotix.com>.
> How do I format the query in order to keep the parser satisfied? I
> have tried the usual \-escaping of difficult characters, but that
> doesn't work either. Is there a way to set which characters are
> allowed in a query or something similar?

Right now, the parser doesn't know what to do with the characters like
= and &.  


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Parsing of queries.; NEAR queries

Posted by Brian Goetz <br...@quiotix.com>.
>Wow, you fixed it just a few days after I mentioned it (or is is an
>old issue?).

I'd like to take credit for being ultra-responsive, but really we've been 
kicking this around for a while.  But you did motivate me to get off my 
butt and fix it instead of just thinking about it.


--
Brian Goetz
Quiotix Corporation
brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Parsing of queries.; NEAR queries

Posted by Stefan Bergstrand <st...@polopoly.com>.
Brian Goetz <br...@quiotix.com> writes:

> >doc.add(Field.Text("path", "d=100&a=102"));
> 
> I have, finally, fixed the query parser to define query terms by
> exclusion instead of inclusion.  The terms above, as well as the many
> other posted examples, should now work.

Wow, you fixed it just a few days after I mentioned it (or is is an
old issue?). Anyway, after having tested Lucene (and the
lucene-[dev|user] mailing lists) I am prepared to enter the Church
of Lucene community, and exit the swish-e swamp. (Ok, it's not a fair
contest, since I perform indexing and searching from a Java
program. But still.)

/Stefan B

> 
> A side-effect of this is that special tokens, like && for AND, and ||
> for OR, must be separated from the query terms by spaces: if you want
> a && b, you have to say a && b, not a&&b.  I don't think this should
> be a problem.
> 
> Next up: NEAR.  Everyone wants it, but we're looking for a decent
> syntax, and many of the good punctuation characters have already been
> snapped up (like brackets and braces for range queries.)
> 
> We could use
>    a NEAR b
> or
>    a WITHIN N OF b
> but these both have the problem that they don't generalize well to
> phrases with more than two terms.
> 
> Or we could have a (yet another) modifier on the quoted phrase query
> to set the slop --
> 
>    "Mickey Minnie"(5)
> or
>    "Mickey Minnie" SLOP(5)
> 
> Lots of possibilities exist, but so far they're all pretty
> yucky. Suggestions?
> 
> 
> 
> --
> Brian Goetz
> Quiotix Corporation
> brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032
> 
> http://www.quiotix.com
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Parsing of queries.; NEAR queries

Posted by ca...@bookandhammer.com.
I know, I was really just giving my ideal, hoping that someone would 
say, we could do that.

--Peter
On Thursday, January 17, 2002, at 12:11 AM, Brian Goetz wrote:

> Except that's not how Phrase queries work.  Phrase queries are composed 
> of Terms, and the slop factor tells how close the terms have to be to 
> each other.  So right now, there's no way to search for phrase near 
> phrase.  However, you can search for three words in close proximity, by 
> creating a Phrase query with a slop of > 1.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Parsing of queries.; NEAR queries

Posted by Brian Goetz <br...@quiotix.com>.
>I guess what I would really like is
>
>"Microsoft Word" NEAR3 "Microsoft Excel"
>
>Where I could combine the phrases together with a NEAR operator.

Except that's not how Phrase queries work.  Phrase queries are composed of 
Terms, and the slop factor tells how close the terms have to be to each 
other.  So right now, there's no way to search for phrase near 
phrase.  However, you can search for three words in close proximity, by 
creating a Phrase query with a slop of > 1.


--
Brian Goetz
Quiotix Corporation
brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Parsing of queries.; NEAR queries

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

> We could use
>    a NEAR b
> or
>    a WITHIN N OF b
> but these both have the problem that they don't generalize well to
> phrases 
> with more than two terms.
> 
> Or we could have a (yet another) modifier on the quoted phrase query
> to set the slop --
> 
>    "Mickey Minnie"(5)
> or
>    "Mickey Minnie" SLOP(5)
> 
> Lots of possibilities exist, but so far they're all pretty yucky. 
> Suggestions?

I think I like "Mickey Minnie"~5 idea from Doug a little better (less
typing)...

Thanks,
Otis


__________________________________________________
Do You Yahoo!?
Send FREE video emails in Yahoo! Mail!
http://promo.yahoo.com/videomail/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Parsing of queries.; NEAR queries

Posted by ca...@bookandhammer.com.
I guess what I would really like is

"Microsoft Word" NEAR3 "Microsoft Excel"

Where I could combine the phrases together with a NEAR operator.

Usually, at least in my mind I am searching for words or phrases near 
other words or phrases, not a set of words
close to each other.

So I like the idea of having
a NEAR8 b
It could be
a NEAR(8) b, but that seems too programmer like and I don't think in 
general people get the concept of slop, like the concept of near. So 
something like "a b c"NEAR(5) is really cryptic.


Oracle intermedia uses NEAR(a,b,#,directional).
Even more cryptic.

Any other thoughts?

--Peter


On Wednesday, January 16, 2002, at 06:54 PM, Brian Goetz wrote:

>
>> doc.add(Field.Text("path", "d=100&a=102"));
>
> I have, finally, fixed the query parser to define query terms by 
> exclusion instead of inclusion.  The terms above, as well as the many 
> other posted examples, should now work.
>
> A side-effect of this is that special tokens, like && for AND, and || 
> for OR, must be separated from the query terms by spaces: if you want 
> a && b, you have to say a && b, not a&&b.  I don't think this should be 
> a problem.
>
> Next up: NEAR.  Everyone wants it, but we're looking for a decent 
> syntax, and many of the good punctuation characters have already been 
> snapped up (like brackets and braces for range queries.)
>
> We could use
>   a NEAR b
> or
>   a WITHIN N OF b
> but these both have the problem that they don't generalize well to 
> phrases with more than two terms.
>
> Or we could have a (yet another) modifier on the quoted phrase query to 
> set the slop --
>
>   "Mickey Minnie"(5)
> or
>   "Mickey Minnie" SLOP(5)
>
> Lots of possibilities exist, but so far they're all pretty yucky. 
> Suggestions?
>
>
>
> --
> Brian Goetz
> Quiotix Corporation
> brian@quiotix.com           Tel: 650-843-1300            Fax: 
> 650-324-8032
>
> http://www.quiotix.com
>
>
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Parsing of queries.; NEAR queries

Posted by Brian Goetz <br...@quiotix.com>.
>doc.add(Field.Text("path", "d=100&a=102"));

I have, finally, fixed the query parser to define query terms by exclusion 
instead of inclusion.  The terms above, as well as the many other posted 
examples, should now work.

A side-effect of this is that special tokens, like && for AND, and || for 
OR, must be separated from the query terms by spaces: if you want a && b, 
you have to say a && b, not a&&b.  I don't think this should be a problem.

Next up: NEAR.  Everyone wants it, but we're looking for a decent syntax, 
and many of the good punctuation characters have already been snapped up 
(like brackets and braces for range queries.)

We could use
   a NEAR b
or
   a WITHIN N OF b
but these both have the problem that they don't generalize well to phrases 
with more than two terms.

Or we could have a (yet another) modifier on the quoted phrase query to set 
the slop --

   "Mickey Minnie"(5)
or
   "Mickey Minnie" SLOP(5)

Lots of possibilities exist, but so far they're all pretty yucky. 
Suggestions?



--
Brian Goetz
Quiotix Corporation
brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>