You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mark Miller <ma...@gmail.com> on 2006/12/06 02:21:57 UTC

New Lucene QueryParser

I have finally delved back into the Lucene Query parser that I started a 
few months back. I am very closing to wrapping up it's initial 
development. I am currently looking for anybody willing to help me out 
with a little testing and maybe some design consultation (I am not happy 
with the current range query  syntax for one). If you have any 
interested  in using this parser and have a little time to help out, 
please do. The parser is extremely customizable and you can basically 
mold it into whatever you want. A brief outline of the feature set:

The basics from Lucene query parser are covered: escaping operators, 
handling tokens at the same position, range queries, etc.

Default Operators are: & | ! ~ ( )
New operators can be defined and default operators can be hidden on the fly.

Adds a proximity operator to the standard AND, OR, and ANDNOT operators 
allowing for queries like:
(search bear) ~5 (snake & horse ~4 pope) | crazy query

The default space operator is customizable and can be made to bind 
tighter than if you use the actual operator (the operator acts like the 
actual operator but within parenthesis).

The order of operations for the operators is customizable. The default 
order is |, &, ~, !, ( )...you can change it to whatever you want.

Query-time thesaurus expansion / General token to query expansion : 
Takes advantage of a general find/replace feature, "expand" might map to 
"(expander | expanded)" ... or any other valid syntax. There is also a 
slower RegEx feature so that you can match tokens with a Pattern and 
perform back reference enabled replacements. You can also make the 
replacement behave as an operator...you might map NEAR to ~10 , creating 
a new operator that performs within 10 word proximity searches.

Did You Mean feature using the SpellCheck contrib: if you search for 
'date(Aug 3, 1952) & mackine | rabbit' you might get a suggestion of : 
'date(Aug 3, 1952) & machine | rabbit'

Paragraph/Sentence proximity search functionality. You can inject tokens 
to specify paragraph and sentence markers and perform SpanNotWithin 
searches for paragraph sentence proximity searches.

Customizable date parser.

Everything is pretty much configurable on the fly.

Note that there may be some limitations...but so far this has proved to 
be pretty powerful. I could sure use some testing help making it 
production ready though. I will be putting a new website up for the 
parser soon. Please send me a note if you can help out at all. When I 
put up the jar you can just run it with Java -jar and it will provide a 
console input to enter queries and see the Lucene Query generated.

- Mark Miller





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: New Lucene QueryParser

Posted by Mark Miller <ma...@gmail.com>.

Hey Laurent,

I am actually pretty much ready for a beta/preview release right about 
now. All of the features are in and I am pretty happy with most of the 
work. Over the past month I have been squashing bugs and could certainly 
use as much help as I can get making sure this thing is as perfect as it 
can be. I am currently in the middle of migrating to a new laptop, so I 
may take a couple days to get a distribution jar together with some 
simple documentation, but I plan on doing that as soon as I get a chance.

>
>> Query-time thesaurus expansion / General token to query expansion : 
>> Takes advantage of a general find/replace feature, "expand" might map 
>> to "(expander | expanded)" ... or any other valid syntax. 
> This I could also use, if can also do following ?
> right now I've a little utility class which expands special strings 
> (syntax is to be disc.) to all combinations :
> "fest[,e] hypothek[,en,a]"
> -> fest hypothek;fest hypotheken;fest hypotheka;feste hypothek;feste 
> hypotheken;feste hypotheka
>
I require a similar feature, although in the form mark{s es ing} -> 
marks markes marking. Unfortunately, the way I have done it (in the 
JavaCC grammer) is not easily configurable.

>> Note that there may be some limitations...but so far this has proved 
>> to be pretty powerful
> Would still be good to know the limitations you see right now...
>
I mentioned there might be limitations because I kept running into new 
difficult problems and I just didn't know if something would come up I 
could not get around or if something would be too slow etc. Not to 
mention I am still a little (or a lot depending on who you talk to) wet 
behind the ears. So far I have not run into any limitations. That 
certainly does not mean they don't exists though :) I'm still crossing 
my fingers. My goal is to make this thing as perfect as I can. It's 
basically my new hobby.


- Mark



> Mark Miller wrote:
>> I have finally delved back into the Lucene Query parser that I 
>> started a few months back. I am very closing to wrapping up it's 
>> initial development. I am currently looking for anybody willing to 
>> help me out with a little testing and maybe some design consultation 
>> (I am not happy with the current range query  syntax for one). If you 
>> have any interested  in using this parser and have a little time to 
>> help out, please do. The parser is extremely customizable and you can 
>> basically mold it into whatever you want. A brief outline of the 
>> feature set:
>>
>> The basics from Lucene query parser are covered: escaping operators, 
>> handling tokens at the same position, range queries, etc.
>>
>> Default Operators are: & | ! ~ ( )
>> New operators can be defined and default operators can be hidden on 
>> the fly.
>>
>> Adds a proximity operator to the standard AND, OR, and ANDNOT 
>> operators allowing for queries like:
>> (search bear) ~5 (snake & horse ~4 pope) | crazy query
>>
>> The default space operator is customizable and can be made to bind 
>> tighter than if you use the actual operator (the operator acts like 
>> the actual operator but within parenthesis).
>>
>> The order of operations for the operators is customizable. The 
>> default order is |, &, ~, !, ( )...you can change it to whatever you 
>> want.
>>
>> Query-time thesaurus expansion / General token to query expansion : 
>> Takes advantage of a general find/replace feature, "expand" might map 
>> to "(expander | expanded)" ... or any other valid syntax. There is 
>> also a slower RegEx feature so that you can match tokens with a 
>> Pattern and perform back reference enabled replacements. You can also 
>> make the replacement behave as an operator...you might map NEAR to 
>> ~10 , creating a new operator that performs within 10 word proximity 
>> searches.
>>
>> Did You Mean feature using the SpellCheck contrib: if you search for 
>> 'date(Aug 3, 1952) & mackine | rabbit' you might get a suggestion of 
>> : 'date(Aug 3, 1952) & machine | rabbit'
>>
>> Paragraph/Sentence proximity search functionality. You can inject 
>> tokens to specify paragraph and sentence markers and perform 
>> SpanNotWithin searches for paragraph sentence proximity searches.
>>
>> Customizable date parser.
>>
>> Everything is pretty much configurable on the fly.
>>
>> Note that there may be some limitations...but so far this has proved 
>> to be pretty powerful. I could sure use some testing help making it 
>> production ready though. I will be putting a new website up for the 
>> parser soon. Please send me a note if you can help out at all. When I 
>> put up the jar you can just run it with Java -jar and it will provide 
>> a console input to enter queries and see the Lucene Query generated.
>>
>> - Mark Miller
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: New Lucene QueryParser

Posted by Laurent Hoss <l....@netbreeze.ch>.

Hi Mark

As said in a previous mail, I'm very interested in your Parser and I'm 
happy to hear you made progress , and implemented
Paragraph/Sentence proximity search functionality. :)
This is the killer feature for me!
 and if the execution of the resulting query  ( a mix containing 
SpanQuery 's)  is not (much) slower  than using Boolean/Pharse-Query 
Combos,  it would allow me to forget  our current  "1 lucene-doc per 
paragraph" Indexing Model.
But also the other features are very cool, like the DateParsing which I 
strongly miss in the standard QueryParser !

So let me hear, when you have a version ready to be tested, and how I 
can help.

-Laurent

PS: Some other notes
> Query-time thesaurus expansion / General token to query expansion : 
> Takes advantage of a general find/replace feature, "expand" might map 
> to "(expander | expanded)" ... or any other valid syntax. 
This I could also use, if can also do following ?
right now I've a little utility class which expands special strings 
(syntax is to be disc.) to all combinations :
"fest[,e] hypothek[,en,a]"
-> fest hypothek;fest hypotheken;fest hypotheka;feste hypothek;feste 
hypotheken;feste hypotheka

> Note that there may be some limitations...but so far this has proved 
> to be pretty powerful
Would still be good to know the limitations you see right now...



Mark Miller wrote:
> I have finally delved back into the Lucene Query parser that I started 
> a few months back. I am very closing to wrapping up it's initial 
> development. I am currently looking for anybody willing to help me out 
> with a little testing and maybe some design consultation (I am not 
> happy with the current range query  syntax for one). If you have any 
> interested  in using this parser and have a little time to help out, 
> please do. The parser is extremely customizable and you can basically 
> mold it into whatever you want. A brief outline of the feature set:
>
> The basics from Lucene query parser are covered: escaping operators, 
> handling tokens at the same position, range queries, etc.
>
> Default Operators are: & | ! ~ ( )
> New operators can be defined and default operators can be hidden on 
> the fly.
>
> Adds a proximity operator to the standard AND, OR, and ANDNOT 
> operators allowing for queries like:
> (search bear) ~5 (snake & horse ~4 pope) | crazy query
>
> The default space operator is customizable and can be made to bind 
> tighter than if you use the actual operator (the operator acts like 
> the actual operator but within parenthesis).
>
> The order of operations for the operators is customizable. The default 
> order is |, &, ~, !, ( )...you can change it to whatever you want.
>
> Query-time thesaurus expansion / General token to query expansion : 
> Takes advantage of a general find/replace feature, "expand" might map 
> to "(expander | expanded)" ... or any other valid syntax. There is 
> also a slower RegEx feature so that you can match tokens with a 
> Pattern and perform back reference enabled replacements. You can also 
> make the replacement behave as an operator...you might map NEAR to ~10 
> , creating a new operator that performs within 10 word proximity 
> searches.
>
> Did You Mean feature using the SpellCheck contrib: if you search for 
> 'date(Aug 3, 1952) & mackine | rabbit' you might get a suggestion of : 
> 'date(Aug 3, 1952) & machine | rabbit'
>
> Paragraph/Sentence proximity search functionality. You can inject 
> tokens to specify paragraph and sentence markers and perform 
> SpanNotWithin searches for paragraph sentence proximity searches.
>
> Customizable date parser.
>
> Everything is pretty much configurable on the fly.
>
> Note that there may be some limitations...but so far this has proved 
> to be pretty powerful. I could sure use some testing help making it 
> production ready though. I will be putting a new website up for the 
> parser soon. Please send me a note if you can help out at all. When I 
> put up the jar you can just run it with Java -jar and it will provide 
> a console input to enter queries and see the Lucene Query generated.
>
> - Mark Miller
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: New Lucene QueryParser

Posted by Mark Miller <ma...@gmail.com>.

>
> Looks like interesting stuff Mark, but why did you make everything so
> configurable (syntax-wise)?  IMO, there is a lot of value to
> standards, and doing things like changing the precedence of operators
> isn't necessarily a good thing :-)
>
I made it so configurable because I needed to implement a certain query 
language at work, but I think that the language is not that great. I 
don't like most of the choices in it. I needed something though, and it 
was going to require a lot of work...not only did we need arbitrary 
mixing of boolean and proximity operators, but we needed the sentence 
and paragraph proximity as well as the thesaurus expansion. We also have 
many people who ask for one offs that only apply to their setup, like 
NEAR being an operator that is really within 10. All of this was not 
something I could guarantee that I could do (I just entered the 
workforce), and I certainly didn't have time at work with everything 
else I needed to do for this project I am working on. I wasn't going to 
put so much free time into a parser that I did not like though. So I 
made it very configurable so that it could be configured into the parser 
I needed while still being the parser I wanted.

> Did you ever get a chance to look at Paul's surround language? (I've
> never had the chance to dive into it myself)
>
I have looked into Paul's parser and it is a very nice piece of work. 
Unfortunately, I needed to duplicate a very specific syntax. Also, 
Paul's parser would not give me sentence and paragraph proximity or 
<i>arbitrary</i> connecting of boolean and prox operators. That brings 
me back to why this is so configurable: a big reason is to be able to 
simulate a syntax that a customer may be familiar with and want 
retained. I think that order of operations should be standard too, but I 
see no problem with the standard at someones site being different than 
the standard I use for another site. Some people may want/need proximity 
to bind tighter than ANDNOR, while others might need/want the reverse. 
Being too configurable has it's draw backs, but I am attempting to 
create an alternative parser, not a QueryParser replacement. Choose the 
best weapon for the job ;)

>> Query-time thesaurus expansion / General token to query expansion :
>> Takes advantage of a general find/replace feature, "expand" might map to
>> "(expander | expanded)" ... or any other valid syntax.
>
> The QueryParser does this instead of TokenFilters?
> Is it based on static configuration?
>
I do not use TokenFilters as it does not fit my requirements (I think). 
Right now, a hashmap is used to map a token to replacement syntax. A 
queryparser is generated from a parserfactory. The parserfactory takes a 
configuration class. When you get a queryparser from the factory you can 
choose to inherit the config from the factory, or you can just set the 
options and configuration directly on the parser. I did this because I 
have a need for a base configuration to a common syntax that individual 
accounts than want to be able to tweak to their needs.

The queryparser is a two pass system. The first pass does not 
tokenize...it does query expansion and preps the suggested query (the 
suggested query must be suggested in the syntax the query was typed in, 
and without expansion). I had worried about speed when I made the 2 pass 
decision, but it has allowed me great flexibility, and with my testing 
so far I have had 0 speed problems.

By the way, I have recently tested the paragraph/sentence proximity 
searching (mark within 4 sentences of dog) on a 300k doc index (docs 
8-20k) and the perceived speed was as fast as a normal one or two work 
boolean search (not a very scientific test :))

A problem with the paragraph/sentence proximity search right now is that 
if there is only 1 doc in the index the proximity search will wrap. I am 
sure this can be fixed.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: New Lucene QueryParser

Posted by Yonik Seeley <yo...@apache.org>.

On 12/5/06, Mark Miller <ma...@gmail.com> wrote:
> I have finally delved back into the Lucene Query parser that I started a
> few months back.

Looks like interesting stuff Mark, but why did you make everything so
configurable (syntax-wise)?  IMO, there is a lot of value to
standards, and doing things like changing the precedence of operators
isn't necessarily a good thing :-)

Did you ever get a chance to look at Paul's surround language? (I've
never had the chance to dive into it myself)

> Query-time thesaurus expansion / General token to query expansion :
> Takes advantage of a general find/replace feature, "expand" might map to
> "(expander | expanded)" ... or any other valid syntax.

The QueryParser does this instead of TokenFilters?
Is it based on static configuration?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org