You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mark Miller <ma...@gmail.com> on 2006/08/21 21:39:25 UTC

Test new query parser?

Is anyone interested in helping me test out a new query parser (i.e is
anyone interested in using this, thereby helping me test it) ?

 The parser uses a intermediate parse tree representation, unlike Lucene's
Query Filter.


The syntax:

date[april 6, 1992] & field2,field3[parrot ~3s yore] | ((cat | horse) &
rabbit ~6 pete)

'&' is an 'and'


'|' is an 'or'


~2 is a 'within 2 words'


~5p is a 'within 5 paragraphs'


~6s is a 'within 6 sentences'


'!' is 'butnot'


' ' is an 'and' that binds tighter than anything else but can only connect
search tokens (i.e 'old man' not 'date[today] man'--that would be
date[today] & man...changeable of course:_)


field1, field2, field3[mark | butternut] performs a search on all 3 fields


date[20050601 to 20060504] uses constant scoring range query filter to date
search (can parse most date formats).


no arbitrary range search yet...

paragraph/sentence support requires a replacement (supplied) standard
analyzer and is optional.
there is also support for a "did you mean" feature that spits back the query
substituted with a did you mean guess based on the words in a supplied index
(the corpus or a dictionary maybe):


field1,field2[horke | tomcat] might return a suggested search of =
field1,field2[horse | tomcat]

I am about to add quoted searches as well as the ability to escape query
keywords.

more feature to come, there is a lot I want to change and add (and fix)...

if you are interested, drop me a line at markrmiller@gmail.com

- mark

Some sample proximity queries:

        example = "monkey fowl ~3 man ~5 horse head ~4 lamb";
        expected = "+(+spanNear([allFields:monkey, allFields:man], 3, false)
+spanNear([allFields:fowl, allFields:man], 3, false))
+(+spanNear([allFields:man, allFields:horse], 5, false)
+spanNear([allFields:man, allFields:head], 5, false)
+(+spanNear([allFields:horse, allFields:lamb], 4, false)
+spanNear([allFields:head, allFields:lamb], 4, false)))";
        assertEquals(expected, parse(example));
//
        example = "monkey ~3 man";
        expected = "spanNear([allFields:monkey, allFields:man], 3, false)";
        assertEquals(expected, parse(example));

        example = "monkey ord~3 man";
        expected = "spanNear([allFields:monkey, allFields:man], 3, true)";
        assertEquals(expected, parse(example));

        example = "monkey ~3 man ~2 her";
        expected = "+spanNear([allFields:monkey, allFields:man], 3, false)
+spanNear([allFields:man, allFields:her], 2, false)";
        assertEquals(expected, parse(example));

        example = "(fowl & helicopter) ~8 hillary";
        expected = "+spanNear([allFields:fowl, allFields:hillary], 8, false)
+spanNear([allFields:helicopter, allFields:hillary], 8, false)";
        assertEquals(expected, parse(example));

        example = "(fowl | helicopter) ~6 hillary";
        expected = "+spanNear([allFields:fowl, allFields:hillary], 8, false)
+spanNear([allFields:helicopter, allFields:hillary], 8, false)";
        assertEquals(expected, parse(example));
//
//        // butnot resolves before proximity search
//        example = "(cop | fowl) & (fowl & priest man) ! helicopter ~8
hillary";
//        expected = "+(allFields:cop allFields:fowl)
+(+spanNear([allFields:fowl, allFields:hillary], 8, false)
+spanNear([allFields:priest, allFields:hillary], 8, false)
+spanNear([allFields:man, allFields:hillary], 8, false)
-spanNear([allFields:helicopter, allFields:hillary], 8, false))";
//        assertEquals(expected, parse(example));
//
        example = "priest man ! helicopter ~8 hillary";
        expected = "+spanNear([allFields:priest, allFields:hillary], 8,
false) +spanNear([allFields:man, allFields:hillary], 8, false)
-spanNear([allFields:helicopter, allFields:hillary], 8, false)";
        assertEquals(expected, parse(example));

Re: Test new query parser?

Posted by Mark Miller <ma...@gmail.com>.
Great, I will get something ready to be given out within a day or so then.

Paragraph/Sent prox support is one thing I really need to test and improve.

The parapraph and sentence search uses a SpanWithinQuery. This is just a
SpanNotQuery that can span a specified number of times instead of not at
all.

mark ~3p dopey

SpanWithinQuery(spanNear([allFields:mark, allFields:dopey], 99999, false),
3, allFields:¶)

>
> mark ord~3p dopey

SpanWithinQuery(spanNear([allFields:mark, allFields:dopey], 99999, true), 3,
allFields:¶)

I use 99999 terribly arbitrarily. Integer.max_value blows things up.

This example uses ¶ as a marker. I think a better default might be a double
newline of some kind. Unfortunately, the marker is not nicely configurable
without source code access because it must be defined in the .jj file of the
modified standard analyzer. Sentences can be deduced more universally:

Sentences are matched with:

<SENTENCE: ([".","?","!","]","[","]","\"","'",")"])+ ([" ","\t","\r","\n"])+
(["[","\"","'","`","("])* >

which is seems to be the standard sentence finder, although it lacks the
ability to check that an alphanumeric comes next (so that it doesn't mark a
sentence at the end of the text). No biggie for now though.

Also, right now paragraphs and sentences are marked separately at all
spots...for compactness a paragraph marker should also represent a sentence
marker. I need to check speed on these things though.

With some others interested in this I will get a move on these issues
though. I really want some feedback to make this thing better. It is not
perfect but I believe it has potential. Adding the other span type query for
example, would not be very difficult. The syntax is pretty easily modified
and extended

- Mark

Re: Test new query parser?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 21, 2006, at 3:39 PM, Mark Miller wrote:
> Is anyone interested in helping me test out a new query parser (i.e is
> anyone interested in using this, thereby helping me test it) ?

I'm definitely interested in giving it a try.  The syntax looks nice.

> ~5p is a 'within 5 paragraphs'
>
> ~6s is a 'within 6 sentences'

What defines paragraphs and sentences?   What types of queries do  
those resolve to?

Thanks,
	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Test new query parser?

Posted by Mark Miller <ma...@gmail.com>.
It is interesting to note that Lucene would also seem to suffer from 
bugs when using spans if you only have a single document in the index. 
At least with the NotSpanQuery, the spans could wrap around the document 
from end to beginning. This would be unexpected but would also go away 
if you add a second doc. This is a limitation (if you could call it 
that) of the spans implementation and not a bug though (I'm speculating 
on all of this).

Hopefully my new idea will sort this problem out. This is a non-work 
project so I don't have much time for it, but it is much more 
interesting than my work so I will probably keep plugging away on off hours.

The new idea is to collect the terms in a composite object that will 
maintain order of operations as well as the tokens.

I have much to do on this composite class, but this query seems to 
generate an acceptable composite:

(me | cop & her) ~2 (old & women) :

cdistrib(cdistrib(distrib(allFields:me)' 
'cdistrib(distrib(allFields:cop)'+'distrib(allFields:her)))) ~3 
cdistrib(distrib(allFields:old)'+'distrib(allFields:women))

Then I will need a function that will distribute one composite across 
another, creating the proximity query (I would guess a boolean at its 
root). I am confident I can make the composite class work, but I have 
not investigated how hard it will be to do the distribution. All of the 
information is there though, so I am assuming it won't be too difficult 
(don't i wish).

- Mark
> Mark --
>
> Don't lose hope! We are migrating from Verity to Lucene, and I know for
> a fact that we will have to support the kind of complex queries you talk
> about; maybe not /quite/ as complex as your magnificent:
>
>> cop | fowl & (fowl | priest & man) ! helicopter ~8 (hillary | tom)
>
> but certainly more complex than I have been able to solve.
>
> We can also take heart in that Verity hasn't quite cracked it either. In
> trying to see exactly what I needed to support I was doing some
> experiments against known data and discovered that Verity has some
> parsing bugs that do not reveal themselves easily. For example:
>
>   title:  "get me out of here, please"
>   queryA: title:((here OR there) NEAR/2 please)
>   queryB: title:(there OR here) NEAR/2 please)
>
> In theory both queries should find the example title, but only queryB
> works with Verity, so something is wrong.
>
> -- Robert
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Test new query parser?

Posted by Mark Miller <ma...@gmail.com>.
I have received a few inquires about my new query parser. I apologize 
for making that announcement a little premature. My current 
implementation only allows simple mixing of proximity queries with 
boolean queries...complex mixing would result in an incorrect search. A 
reply to my first email made me consider this more (I had done that part 
a while ago) and I came to the conclusion that it was obviously 
unacceptable to release the parser to anyone in this hobbled form. The 
parser must support arbitrary mixing of boolean and proximity searchers.

I think I have cracked this. I would say I am at 90%  of the way to a 
solution and can see the light at the end of the tunnel. When I have 
resolved this issue, I will contact those that have expressed interest 
and provide them with the parser. With some feedback and improvements I 
will think about how to release it generally.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Test new query parser?

Posted by Robert Watkins <rw...@foo-bar.org>.
Mark --

Don't lose hope! We are migrating from Verity to Lucene, and I know for
a fact that we will have to support the kind of complex queries you talk
about; maybe not /quite/ as complex as your magnificent:

> cop | fowl & (fowl | priest & man) ! helicopter ~8 (hillary | tom)

but certainly more complex than I have been able to solve.

We can also take heart in that Verity hasn't quite cracked it either. In
trying to see exactly what I needed to support I was doing some
experiments against known data and discovered that Verity has some
parsing bugs that do not reveal themselves easily. For example:

   title:  "get me out of here, please"
   queryA: title:((here OR there) NEAR/2 please)
   queryB: title:(there OR here) NEAR/2 please)

In theory both queries should find the example title, but only queryB
works with Verity, so something is wrong.

-- Robert

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Test new query parser?

Posted by Mark Miller <ma...@gmail.com>.
You are not kidding. I keep finding this tougher and tougher. Originally 
my method worked for most simple queries, but not all. I improved it to 
cover some more ground, allowing some modestly complex queries, but now 
even that improvement seems woefully inadequate for solving the general 
case. I was handling some pretty complex queries, but losing order of 
operations in a proximity query if things got too exciting. Among other 
small bugs.

The parse tree handles the boolean stuff fine on its own, but the 
proximity seems to require a distributed attack (think algebra) that 
still maintains order of operations. It is a bit nasty.

consider:
cop | fowl & (fowl | priest & man) ! helicopter ~8 (hillary | tom)

this must distribute to (roughly)
cop | ( (fowl & ( (fowl ~8 hillary | (priest ~8 hillary & man ~8 
hillary) ) ! helicopter ~8 hillary) | (fowl ~8 tom | (priest ~8 tom & 
man ~8 tom) ) ! helicopter ~8 tom))

I have gotten close but instead of (fowl | (priest & man)) I might get 
((fowl | priest) & man)...a naive distribution will ignore order of ops 
and order by left to right.

my order of ops:
& "and"
| "or"
~ "within"
! "butnot"
<space between words>

I have a new plan of attack that I have begun, but who knows where it 
will lead. I thought I was so close, but apparently just a tease...that 
method could only take me so far. I hate to put so much work into this 
since I doubt anyone will even use such complex queries (the queries I 
monitor are always so basic) but I may give it a go just to see if my 
new idea will solve the general case.

We will see if this parser actually has any life in it. Maybe I am no 
closer than you where-- I am very new at this.

- Mark
> Mark --
>
> Yes please! I'm very interested in the mixing of boolean and proximity
> operators. I have also worked on a parser (using JavaCC) but haven't
> managed to crack queries such as:
>
>     ((a OR b) AND c) NEAR (d NOT e)
>
> I can get the parse tree okay, but haven't figured out how to translate
> that into a valid Lucene Query object. Simple queries such as:
>
>     (a OR b) NEXT (c OR d)   // note the use of OR exclusively!
>
> are okay, but nothing more complex. So: bring it on!
>
> -- Robert
> rwatkins at foo-bar.org
>
> On Mon, 21 Aug 2006, Mark Miller wrote:
>
>> Is anyone interested in helping me test out a new query parser (i.e is
>> anyone interested in using this, thereby helping me test it) ?
>>
>> The parser uses a intermediate parse tree representation, unlike 
>> Lucene's
>> Query Filter.
>>
>> [ snipped ]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Test new query parser?

Posted by Robert Watkins <rw...@foo-bar.org>.
Mark --

Yes please! I'm very interested in the mixing of boolean and proximity
operators. I have also worked on a parser (using JavaCC) but haven't
managed to crack queries such as:

     ((a OR b) AND c) NEAR (d NOT e)

I can get the parse tree okay, but haven't figured out how to translate
that into a valid Lucene Query object. Simple queries such as:

     (a OR b) NEXT (c OR d)   // note the use of OR exclusively!

are okay, but nothing more complex. So: bring it on!

-- Robert
rwatkins at foo-bar.org

On Mon, 21 Aug 2006, Mark Miller wrote:

> Is anyone interested in helping me test out a new query parser (i.e is
> anyone interested in using this, thereby helping me test it) ?
>
> The parser uses a intermediate parse tree representation, unlike Lucene's
> Query Filter.
>
> [ snipped ]

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org