You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by John Byrne <jo...@propylon.com> on 2007/10/22 13:31:29 UTC
Queries spanning paragraphs
Hi all,
I need the ability to match documents that have two terms that occur
within n paragraphs of each other. I had a look through the archives,
and although many people have explained ways to implement per-sentence
or per-paragraph indexing & searching, no seems to have tackeled this
one yet.
The only idea I can up up with is this:
I will index the entire document, as normal, but also index the
paragraphs seperately, numbering them accoring to the order they occur
in. (Storage space isn't an issue). When searching, I will first find
all documents that have both terms, using the full-content field.
Then I can get all the paragraphs that are part of that doc, and have
either of the search terms. I would still have to implement a bit of
logic to check which paragraphs have which term, and check the distance
between them (from the order info I kept when indexing).
I'm sure this would work, but it would be very slow. I can't help
feeling there's a better solution, that might involve inserting
paragraph tags into the content in a special field in my index, and
somehow using SpanQueries to find matches that have a given number of
paragraph marks in between... but I don't know if that's possible.
Does anyone have any ideas?
Thanks!
John B.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Queries spanning paragraphs
Posted by Mark Miller <ma...@gmail.com>.
It is stable...give it a whirl. I use it at about 5 or 6 different
heavily used installs at the moment and know of about a dozen others
that use it (many others have downloaded, but who knows what for). If
you notice anything off with it, I will fix immediately as I use it
heavily in production environments (newspaper business). Just send me
the query that does not work as you would expect.
It is Apache License. If you have any comments and/or requests, I am
currently working on a second version and am happy to receive any feedback.
Qsol was started a little over a year ago and the first beta release was
in January. 1.0 was released in June. An early version was even
translated to C# by a guy that needed it. Because of some heavy changes
I plan, I am now moving on to 2.0.
Do keep in mind though...the Highlighter extension I wrote does not
require Qsol: https://issues.apache.org/jira/browse/LUCENE-794, though
it does complement the span support nicely.
Also, the default Lucene QueryParser does have limited configurable
operators I believe.
- Mark
John Byrne wrote:
> Thanks for that, that's exactly what I needed.
>
> Actually, I hadn't heard of qsol, but it seems to solve a few other
> problems I have as well - correct highlighting, configurable
> operators, sentence recogition. Is it distributed under the Apache
> license? and is it currently stable enough to use out-of-the-box?
>
> Cheers,
> John
>
>
> Mark Miller wrote:
>> I implemented this for my qsol query parser: myhardshadow.com/qsol
>>
>> Uses a modified SpanNotQuery that takes another parameter saying how
>> many times the span can cross the specified marker. Index a special
>> paragraph marker with your text to delimit paragraphs and then the
>> rest is easy.
>>
>> - Mark
>>
>> public class SpanWithinQuery extends SpanQuery {
>>
>> private SpanQuery include;
>> private SpanQuery exclude;
>> private int proximity;
>>
>> /** Construct a SpanWithinQuery matching spans from
>> <code>include</code> which
>> * overlap with spans from <code>exclude</code> up to
>> <code>proximity</code> times.*/
>> public SpanWithinQuery(SpanQuery include, SpanQuery exclude, int
>> proximity) {
>> this.include = include;
>> this.exclude = exclude;
>> this.proximity = proximity;
>>
>> if (!include.getField().equals(exclude.getField())) {
>> throw new IllegalArgumentException("Clauses must have same
>> field.");
>> }
>> }
>>
>> /** Return the SpanQuery whose matches are filtered. */
>> public SpanQuery getInclude() {
>> return include;
>> }
>>
>> /** Return the SpanQuery whose matches must not overlap those
>> returned. */
>> public SpanQuery getExclude() {
>> return exclude;
>> }
>>
>> public String getField() {
>> return include.getField();
>> }
>>
>> /** Returns a collection of all terms matched by this query.
>> * @deprecated use extractTerms instead
>> * @see #extractTerms(Set)
>> */
>> public Collection getTerms() {
>> return include.getTerms();
>> }
>>
>> public void extractTerms(Set terms) {
>> include.extractTerms(terms);
>> }
>>
>> public String toString(String field) {
>> StringBuffer buffer = new StringBuffer();
>> buffer.append("spanWithin(");
>> buffer.append(include.toString(field));
>> buffer.append(", ");
>> buffer.append(proximity + " ,");
>> buffer.append(exclude.toString(field));
>> buffer.append(")");
>> buffer.append(ToStringUtils.boost(getBoost()));
>>
>> return buffer.toString();
>> }
>>
>> public Spans getSpans(final IndexReader reader) throws IOException {
>> return new Spans() {
>> private Spans includeSpans = include.getSpans(reader);
>> private boolean moreInclude = true;
>> private Spans excludeSpans = exclude.getSpans(reader);
>> private boolean moreExclude = true;
>>
>> public boolean next() throws IOException {
>> if (moreInclude) { // move to next include
>> moreInclude = includeSpans.next();
>> }
>>
>> while (moreInclude && moreExclude) {
>> if (includeSpans.doc() > excludeSpans.doc()) {
>> // skip exclude
>> moreExclude =
>> excludeSpans.skipTo(includeSpans.doc());
>> }
>>
>> int count = 0;
>>
>> while (moreExclude // while exclude is before
>> &&(includeSpans.doc() ==
>> excludeSpans.doc())) {
>> if ((!(excludeSpans.end() <=
>> includeSpans.start()))) {
>> count += 1;
>>
>> if (count > proximity) {
>> break;
>> }
>> }
>>
>> moreExclude = excludeSpans.next(); //
>> increment exclude
>> }
>>
>> if (!moreExclude // if no intersection
>> ||(includeSpans.doc() !=
>> excludeSpans.doc()) ||
>> (includeSpans.end() <=
>> excludeSpans.start())) {
>> break; // we found a match
>> }
>>
>> moreInclude = includeSpans.next(); //
>> intersected: keep scanning
>> }
>>
>> return moreInclude;
>> }
>>
>> public boolean skipTo(int target) throws IOException {
>> if (moreInclude) { // skip include
>> moreInclude = includeSpans.skipTo(target);
>> }
>>
>> if (!moreInclude) {
>> return false;
>> }
>>
>> if (moreExclude // skip exclude
>> &&(includeSpans.doc() >
>> excludeSpans.doc())) {
>> moreExclude =
>> excludeSpans.skipTo(includeSpans.doc());
>> }
>>
>> int count = 0;
>>
>> while (moreExclude // while exclude is before
>> &&(includeSpans.doc() ==
>> excludeSpans.doc())) {
>> if ((!(excludeSpans.end() <=
>> includeSpans.start()))) {
>> count += 1;
>>
>> if (count > proximity) {
>> break;
>> }
>> }
>>
>> moreExclude = excludeSpans.next(); //
>> increment exclude
>> }
>>
>> if (!moreExclude // if no intersection
>> ||(includeSpans.doc() !=
>> excludeSpans.doc()) ||
>> (includeSpans.end() <=
>> excludeSpans.start())) {
>> return true; // we found a match
>> }
>>
>> boolean returnboolean = next();
>>
>> return returnboolean; // scan to next match
>> }
>>
>> public int doc() {
>> return includeSpans.doc();
>> }
>>
>> public int start() {
>> return includeSpans.start();
>> }
>>
>> public int end() {
>> return includeSpans.end();
>> }
>>
>> public String toString() {
>> return "spans(" + SpanWithinQuery.this.toString()
>> + ")";
>> }
>> };
>> }
>>
>> public Query rewrite(IndexReader reader) throws IOException {
>> SpanWithinQuery clone = null;
>>
>> SpanQuery rewrittenInclude = (SpanQuery) include.rewrite(reader);
>>
>> if (rewrittenInclude != include) {
>> clone = (SpanWithinQuery) this.clone();
>> clone.include = rewrittenInclude;
>> }
>>
>> SpanQuery rewrittenExclude = (SpanQuery) exclude.rewrite(reader);
>>
>> if (rewrittenExclude != exclude) {
>> if (clone == null) {
>> clone = (SpanWithinQuery) this.clone();
>> }
>>
>> clone.exclude = rewrittenExclude;
>> }
>>
>> if (clone != null) {
>> return clone; // some clauses rewrote
>> } else {
>> return this; // no clauses rewrote
>> }
>> }
>>
>> /** Returns true iff <code>o</code> is equal to this. */
>> public boolean equals(Object o) {
>> if (this == o) {
>> return true;
>> }
>>
>> if (!(o instanceof SpanWithinQuery)) {
>> return false;
>> }
>>
>> SpanWithinQuery other = (SpanWithinQuery) o;
>>
>> return this.include.equals(other.include) &&
>> this.exclude.equals(other.exclude) &&
>> (this.getBoost() == other.getBoost()) && (proximity ==
>> other.proximity);
>> }
>>
>> public int hashCode() {
>> int h = include.hashCode();
>> h = (h << 1) | (h >>> 31); // rotate left
>> h ^= exclude.hashCode();
>> h = (h << 1) | (h >>> 31); // rotate left
>> h ^= Float.floatToRawIntBits(getBoost());
>> h ^= proximity;
>>
>> return h;
>> }
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Queries spanning paragraphs
Posted by John Byrne <jo...@propylon.com>.
Thanks for that, that's exactly what I needed.
Actually, I hadn't heard of qsol, but it seems to solve a few other
problems I have as well - correct highlighting, configurable operators,
sentence recogition. Is it distributed under the Apache license? and is
it currently stable enough to use out-of-the-box?
Cheers,
John
Mark Miller wrote:
> I implemented this for my qsol query parser: myhardshadow.com/qsol
>
> Uses a modified SpanNotQuery that takes another parameter saying how
> many times the span can cross the specified marker. Index a special
> paragraph marker with your text to delimit paragraphs and then the
> rest is easy.
>
> - Mark
>
> public class SpanWithinQuery extends SpanQuery {
>
> private SpanQuery include;
> private SpanQuery exclude;
> private int proximity;
>
> /** Construct a SpanWithinQuery matching spans from
> <code>include</code> which
> * overlap with spans from <code>exclude</code> up to
> <code>proximity</code> times.*/
> public SpanWithinQuery(SpanQuery include, SpanQuery exclude, int
> proximity) {
> this.include = include;
> this.exclude = exclude;
> this.proximity = proximity;
>
> if (!include.getField().equals(exclude.getField())) {
> throw new IllegalArgumentException("Clauses must have same
> field.");
> }
> }
>
> /** Return the SpanQuery whose matches are filtered. */
> public SpanQuery getInclude() {
> return include;
> }
>
> /** Return the SpanQuery whose matches must not overlap those
> returned. */
> public SpanQuery getExclude() {
> return exclude;
> }
>
> public String getField() {
> return include.getField();
> }
>
> /** Returns a collection of all terms matched by this query.
> * @deprecated use extractTerms instead
> * @see #extractTerms(Set)
> */
> public Collection getTerms() {
> return include.getTerms();
> }
>
> public void extractTerms(Set terms) {
> include.extractTerms(terms);
> }
>
> public String toString(String field) {
> StringBuffer buffer = new StringBuffer();
> buffer.append("spanWithin(");
> buffer.append(include.toString(field));
> buffer.append(", ");
> buffer.append(proximity + " ,");
> buffer.append(exclude.toString(field));
> buffer.append(")");
> buffer.append(ToStringUtils.boost(getBoost()));
>
> return buffer.toString();
> }
>
> public Spans getSpans(final IndexReader reader) throws IOException {
> return new Spans() {
> private Spans includeSpans = include.getSpans(reader);
> private boolean moreInclude = true;
> private Spans excludeSpans = exclude.getSpans(reader);
> private boolean moreExclude = true;
>
> public boolean next() throws IOException {
> if (moreInclude) { // move to next include
> moreInclude = includeSpans.next();
> }
>
> while (moreInclude && moreExclude) {
> if (includeSpans.doc() > excludeSpans.doc()) {
> // skip exclude
> moreExclude =
> excludeSpans.skipTo(includeSpans.doc());
> }
>
> int count = 0;
>
> while (moreExclude // while exclude is before
> &&(includeSpans.doc() ==
> excludeSpans.doc())) {
> if ((!(excludeSpans.end() <=
> includeSpans.start()))) {
> count += 1;
>
> if (count > proximity) {
> break;
> }
> }
>
> moreExclude = excludeSpans.next(); //
> increment exclude
> }
>
> if (!moreExclude // if no intersection
> ||(includeSpans.doc() !=
> excludeSpans.doc()) ||
> (includeSpans.end() <=
> excludeSpans.start())) {
> break; // we found a match
> }
>
> moreInclude = includeSpans.next(); //
> intersected: keep scanning
> }
>
> return moreInclude;
> }
>
> public boolean skipTo(int target) throws IOException {
> if (moreInclude) { // skip include
> moreInclude = includeSpans.skipTo(target);
> }
>
> if (!moreInclude) {
> return false;
> }
>
> if (moreExclude // skip exclude
> &&(includeSpans.doc() >
> excludeSpans.doc())) {
> moreExclude =
> excludeSpans.skipTo(includeSpans.doc());
> }
>
> int count = 0;
>
> while (moreExclude // while exclude is before
> &&(includeSpans.doc() ==
> excludeSpans.doc())) {
> if ((!(excludeSpans.end() <=
> includeSpans.start()))) {
> count += 1;
>
> if (count > proximity) {
> break;
> }
> }
>
> moreExclude = excludeSpans.next(); // increment
> exclude
> }
>
> if (!moreExclude // if no intersection
> ||(includeSpans.doc() !=
> excludeSpans.doc()) ||
> (includeSpans.end() <=
> excludeSpans.start())) {
> return true; // we found a match
> }
>
> boolean returnboolean = next();
>
> return returnboolean; // scan to next match
> }
>
> public int doc() {
> return includeSpans.doc();
> }
>
> public int start() {
> return includeSpans.start();
> }
>
> public int end() {
> return includeSpans.end();
> }
>
> public String toString() {
> return "spans(" + SpanWithinQuery.this.toString() +
> ")";
> }
> };
> }
>
> public Query rewrite(IndexReader reader) throws IOException {
> SpanWithinQuery clone = null;
>
> SpanQuery rewrittenInclude = (SpanQuery) include.rewrite(reader);
>
> if (rewrittenInclude != include) {
> clone = (SpanWithinQuery) this.clone();
> clone.include = rewrittenInclude;
> }
>
> SpanQuery rewrittenExclude = (SpanQuery) exclude.rewrite(reader);
>
> if (rewrittenExclude != exclude) {
> if (clone == null) {
> clone = (SpanWithinQuery) this.clone();
> }
>
> clone.exclude = rewrittenExclude;
> }
>
> if (clone != null) {
> return clone; // some clauses rewrote
> } else {
> return this; // no clauses rewrote
> }
> }
>
> /** Returns true iff <code>o</code> is equal to this. */
> public boolean equals(Object o) {
> if (this == o) {
> return true;
> }
>
> if (!(o instanceof SpanWithinQuery)) {
> return false;
> }
>
> SpanWithinQuery other = (SpanWithinQuery) o;
>
> return this.include.equals(other.include) &&
> this.exclude.equals(other.exclude) &&
> (this.getBoost() == other.getBoost()) && (proximity ==
> other.proximity);
> }
>
> public int hashCode() {
> int h = include.hashCode();
> h = (h << 1) | (h >>> 31); // rotate left
> h ^= exclude.hashCode();
> h = (h << 1) | (h >>> 31); // rotate left
> h ^= Float.floatToRawIntBits(getBoost());
> h ^= proximity;
>
> return h;
> }
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Queries spanning paragraphs
Posted by Mark Miller <ma...@gmail.com>.
I implemented this for my qsol query parser: myhardshadow.com/qsol
Uses a modified SpanNotQuery that takes another parameter saying how
many times the span can cross the specified marker. Index a special
paragraph marker with your text to delimit paragraphs and then the rest
is easy.
- Mark
public class SpanWithinQuery extends SpanQuery {
private SpanQuery include;
private SpanQuery exclude;
private int proximity;
/** Construct a SpanWithinQuery matching spans from
<code>include</code> which
* overlap with spans from <code>exclude</code> up to
<code>proximity</code> times.*/
public SpanWithinQuery(SpanQuery include, SpanQuery exclude, int
proximity) {
this.include = include;
this.exclude = exclude;
this.proximity = proximity;
if (!include.getField().equals(exclude.getField())) {
throw new IllegalArgumentException("Clauses must have same
field.");
}
}
/** Return the SpanQuery whose matches are filtered. */
public SpanQuery getInclude() {
return include;
}
/** Return the SpanQuery whose matches must not overlap those
returned. */
public SpanQuery getExclude() {
return exclude;
}
public String getField() {
return include.getField();
}
/** Returns a collection of all terms matched by this query.
* @deprecated use extractTerms instead
* @see #extractTerms(Set)
*/
public Collection getTerms() {
return include.getTerms();
}
public void extractTerms(Set terms) {
include.extractTerms(terms);
}
public String toString(String field) {
StringBuffer buffer = new StringBuffer();
buffer.append("spanWithin(");
buffer.append(include.toString(field));
buffer.append(", ");
buffer.append(proximity + " ,");
buffer.append(exclude.toString(field));
buffer.append(")");
buffer.append(ToStringUtils.boost(getBoost()));
return buffer.toString();
}
public Spans getSpans(final IndexReader reader) throws IOException {
return new Spans() {
private Spans includeSpans = include.getSpans(reader);
private boolean moreInclude = true;
private Spans excludeSpans = exclude.getSpans(reader);
private boolean moreExclude = true;
public boolean next() throws IOException {
if (moreInclude) { // move to next include
moreInclude = includeSpans.next();
}
while (moreInclude && moreExclude) {
if (includeSpans.doc() > excludeSpans.doc()) {
// skip exclude
moreExclude =
excludeSpans.skipTo(includeSpans.doc());
}
int count = 0;
while (moreExclude // while exclude is before
&&(includeSpans.doc() ==
excludeSpans.doc())) {
if ((!(excludeSpans.end() <=
includeSpans.start()))) {
count += 1;
if (count > proximity) {
break;
}
}
moreExclude = excludeSpans.next(); //
increment exclude
}
if (!moreExclude // if no intersection
||(includeSpans.doc() !=
excludeSpans.doc()) ||
(includeSpans.end() <=
excludeSpans.start())) {
break; // we found a match
}
moreInclude = includeSpans.next(); //
intersected: keep scanning
}
return moreInclude;
}
public boolean skipTo(int target) throws IOException {
if (moreInclude) { // skip include
moreInclude = includeSpans.skipTo(target);
}
if (!moreInclude) {
return false;
}
if (moreExclude // skip exclude
&&(includeSpans.doc() > excludeSpans.doc())) {
moreExclude =
excludeSpans.skipTo(includeSpans.doc());
}
int count = 0;
while (moreExclude // while exclude is before
&&(includeSpans.doc() == excludeSpans.doc())) {
if ((!(excludeSpans.end() <=
includeSpans.start()))) {
count += 1;
if (count > proximity) {
break;
}
}
moreExclude = excludeSpans.next(); // increment
exclude
}
if (!moreExclude // if no intersection
||(includeSpans.doc() != excludeSpans.doc()) ||
(includeSpans.end() <= excludeSpans.start())) {
return true; // we found a match
}
boolean returnboolean = next();
return returnboolean; // scan to next match
}
public int doc() {
return includeSpans.doc();
}
public int start() {
return includeSpans.start();
}
public int end() {
return includeSpans.end();
}
public String toString() {
return "spans(" + SpanWithinQuery.this.toString() + ")";
}
};
}
public Query rewrite(IndexReader reader) throws IOException {
SpanWithinQuery clone = null;
SpanQuery rewrittenInclude = (SpanQuery) include.rewrite(reader);
if (rewrittenInclude != include) {
clone = (SpanWithinQuery) this.clone();
clone.include = rewrittenInclude;
}
SpanQuery rewrittenExclude = (SpanQuery) exclude.rewrite(reader);
if (rewrittenExclude != exclude) {
if (clone == null) {
clone = (SpanWithinQuery) this.clone();
}
clone.exclude = rewrittenExclude;
}
if (clone != null) {
return clone; // some clauses rewrote
} else {
return this; // no clauses rewrote
}
}
/** Returns true iff <code>o</code> is equal to this. */
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (!(o instanceof SpanWithinQuery)) {
return false;
}
SpanWithinQuery other = (SpanWithinQuery) o;
return this.include.equals(other.include) &&
this.exclude.equals(other.exclude) &&
(this.getBoost() == other.getBoost()) && (proximity ==
other.proximity);
}
public int hashCode() {
int h = include.hashCode();
h = (h << 1) | (h >>> 31); // rotate left
h ^= exclude.hashCode();
h = (h << 1) | (h >>> 31); // rotate left
h ^= Float.floatToRawIntBits(getBoost());
h ^= proximity;
return h;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org