You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Paul Elschot <pa...@xs4all.nl> on 2004/04/18 14:51:45 UTC
Surround query parser
Dear developers,
I'd like to contribute a query parser named Surround.
The implementation uses mostly Lucene's BooleanQuery, TermQuery,
SpanNearQuery, SpanOrQuery and SpanTermQuery. These are chosen
depending on the query operator.
Currently the sources are in a CVS working copy next to a lucene
working copy. There is some test code which uses the latest
lucene jar generated from the lucene working copy.
The source code has cooled down far enough for a
package restructuring. In case there is interest, how would
the sources best be structured? Currently two packages are
used the sources: org.surround.queryparser and
org.surround.search.
Following the name of org.apache.lucene.wordnet in the sandbox,
would org.apache.lucene.surround be ok.?
Regards,
Paul
P.S.:
Surround consists of these operators (uppercase/lowercase):
AND/OR/NOT/nW/nN/ as infix and
AND/OR/nW/nN as prefix.
Distance operators W and N have default n=1, max 99.
Implemented as ordered/unordered SpanQuery with slop = (n - 1).
An example prefix form is:
20N(aa*, bb*, cc*)
The name Surround was chosen because of this prefix form
and because it uses the newly introduced span queries
to implement the proximity operators.
The operators and their prefix and infix
forms were borrowed from the user documentation of
various other query languages on the internet.
Wildcards/truncations are the same as in the
Lucene standard query parser:
* for internal and suffix truncation,
? to match one character.
And there is:
^ for boosting a term or a bracketed subquery.
Some examples (best read with fixed size font):
aa
aa and bb
aa and bb or cc same effect as: (aa and bb) or cc
aa NOT bb NOT cc same effect as: (aa NOT bb) NOT cc
and(aa,bb,cc) aa and bb and cc
99w(aa,bb,cc) ordered span query with slop 98
99n(aa,bb,cc) unordered span query with slop 98
20n(aa*,bb*)
3w(a?a or bb?, cc*) W subqueries: OR, truncation
title: text: aa
title : text : aa or bb
title:text: aa not bb
title:aa not text:bb
cc 3w dd infix: dual.
cc N dd N ee same effect as: (cc N dd) N ee
text: aa 3n bb same effect as: text: (aa 3n bb)
Development status
Not tested: multiple fields, internally mapped to OR queries.
Suffix truncation is implemented very similar to Lucene's PrefixQuery.
Wildcards (? and internal *) are implemented with regular expressions
to allow further variations. A reimplementation using Lucene's
WildCardTermEnum (correct name?) should be no problem.
There is a warning for ordered subqueries with 3 or more subqueries,
due to a pending bug in the ordered SpanNearQuery.
Warnings about missing terms are sent to System.out, this might
be replaced by another stream.
There are no javadoc comments.
I'm using java 1.4.2, so probably there are some dependencies
on java 1.4.
Other tools used: ant 1.6b2 and javacc 3.2.
The build target javacc should be used explicitly
when the .jj file is changed.
The sources, apart from a build.xml file:
... src/java/org/surround/search> wc *.java ../q*/*.jj | sort -r
1424 4322 40776 total
436 1404 11140 ../queryparser/QueryParser.jj
138 484 4582 SpanNearClauseFactory.java
106 316 3359 DistanceQuery.java
101 266 2860 ComposedQuery.java
96 245 2480 SrndTruncQuery.java
95 266 2994 SimpleTerm.java
78 245 2390 FieldsQuery.java
72 218 2044 SrndQuery.java
60 151 1613 SrndPrefixQuery.java
52 132 1378 SrndTermQuery.java
49 158 1446 BasicQueryFactory.java
46 130 1412 OrQuery.java
31 80 826 SrndBooleanQuery.java
22 79 866 NotQuery.java
16 54 569 AndQuery.java
15 59 512 DistanceSubQuery.java
11 35 305 TooManyBasicQueries.java
And the test code:
... /src/test/org/surround/search> wc *.java | sort -r
550 1963 16899 total
203 875 6761 Test03Distance.java
105 444 3582 Test02Boolean.java
97 272 2805 BooleanQueryTest.java
55 144 1528 ExceptionQueryTest.java
51 121 1072 Test01Exceptions.java
39 107 1151 SingleFieldTestDb.java
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org