You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Byrne <jo...@propylon.com> on 2007/10/22 13:31:29 UTC

Queries spanning paragraphs

Hi all,

I need the ability to match documents that have two terms that occur 
within n paragraphs of each other. I had a look through the archives, 
and although many people have explained ways to implement per-sentence 
or per-paragraph indexing & searching, no seems to have tackeled this 
one yet.

The only idea I can up up with is this:

I will index the entire document, as normal, but also index the 
paragraphs seperately, numbering them accoring to the order they occur 
in. (Storage space isn't an issue). When searching, I will first find 
all documents that have both terms, using the full-content field.

Then I can get all the paragraphs that are part of that doc, and have 
either of the search terms. I would still have to implement a bit of 
logic to check which paragraphs have which term, and check the distance 
between them (from the order info I kept when indexing).

I'm sure this would work, but it would be very slow. I can't help 
feeling there's a better solution, that might involve inserting 
paragraph tags into the content in a special field in my index, and 
somehow using SpanQueries to find matches that have a given number of 
paragraph marks in between... but I don't know if that's possible.

Does anyone have any ideas?

Thanks!
John B.




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Queries spanning paragraphs

Posted by Mark Miller <ma...@gmail.com>.

It is stable...give it a whirl. I use it at about 5 or 6 different 
heavily used installs at the moment and know of about a dozen others 
that use it (many others have downloaded, but who knows what for). If 
you notice anything off with it, I will fix immediately as I use it 
heavily in production environments (newspaper business). Just send me 
the query that does not work as you would expect.

It is Apache License. If you have any comments and/or requests, I am 
currently working on a second version and am happy to receive any feedback.

Qsol was started a little over a year ago and the first beta release was 
in January. 1.0 was released in June. An early version was even 
translated to C# by a guy that needed it. Because of some heavy changes 
I plan, I am now moving on to 2.0.

Do keep in mind though...the Highlighter extension I wrote does not 
require Qsol: https://issues.apache.org/jira/browse/LUCENE-794, though 
it does complement the span support nicely.

Also, the default Lucene QueryParser does have limited configurable 
operators I believe.

- Mark

John Byrne wrote:
> Thanks for that, that's exactly what I needed.
>
> Actually, I hadn't heard of qsol, but it seems to solve a few other 
> problems I have as well - correct highlighting, configurable 
> operators, sentence recogition. Is it distributed under the Apache 
> license? and is it currently stable enough to use out-of-the-box?
>
> Cheers,
> John
>
>
> Mark Miller wrote:
>> I implemented this for my qsol query parser: myhardshadow.com/qsol
>>
>> Uses a modified SpanNotQuery that takes another parameter saying how 
>> many times the span can cross the specified marker. Index a special 
>> paragraph marker with your text to delimit paragraphs and then the 
>> rest is easy.
>>
>> - Mark
>>
>> public class SpanWithinQuery extends SpanQuery {
>>
>>    private SpanQuery include;
>>    private SpanQuery exclude;
>>    private int proximity;
>>
>>    /** Construct a SpanWithinQuery matching spans from 
>> <code>include</code> which
>>     *  overlap with spans from <code>exclude</code> up to 
>> <code>proximity</code> times.*/
>>    public SpanWithinQuery(SpanQuery include, SpanQuery exclude, int 
>> proximity) {
>>        this.include = include;
>>        this.exclude = exclude;
>>        this.proximity = proximity;
>>
>>        if (!include.getField().equals(exclude.getField())) {
>>            throw new IllegalArgumentException("Clauses must have same 
>> field.");
>>        }
>>    }
>>
>>    /** Return the SpanQuery whose matches are filtered. */
>>    public SpanQuery getInclude() {
>>        return include;
>>    }
>>
>>    /** Return the SpanQuery whose matches must not overlap those 
>> returned. */
>>    public SpanQuery getExclude() {
>>        return exclude;
>>    }
>>
>>    public String getField() {
>>        return include.getField();
>>    }
>>
>>    /** Returns a collection of all terms matched by this query.
>>     * @deprecated use extractTerms instead
>>     * @see #extractTerms(Set)
>>     */
>>    public Collection getTerms() {
>>        return include.getTerms();
>>    }
>>
>>    public void extractTerms(Set terms) {
>>        include.extractTerms(terms);
>>    }
>>
>>    public String toString(String field) {
>>        StringBuffer buffer = new StringBuffer();
>>        buffer.append("spanWithin(");
>>        buffer.append(include.toString(field));
>>        buffer.append(", ");
>>        buffer.append(proximity + " ,");
>>        buffer.append(exclude.toString(field));
>>        buffer.append(")");
>>        buffer.append(ToStringUtils.boost(getBoost()));
>>
>>        return buffer.toString();
>>    }
>>
>>    public Spans getSpans(final IndexReader reader) throws IOException {
>>        return new Spans() {
>>                private Spans includeSpans = include.getSpans(reader);
>>                private boolean moreInclude = true;
>>                private Spans excludeSpans = exclude.getSpans(reader);
>>                private boolean moreExclude = true;
>>
>>                public boolean next() throws IOException {
>>                    if (moreInclude) { // move to next include
>>                        moreInclude = includeSpans.next();
>>                    }
>>
>>                    while (moreInclude && moreExclude) {
>>                        if (includeSpans.doc() > excludeSpans.doc()) { 
>> // skip exclude
>>                            moreExclude = 
>> excludeSpans.skipTo(includeSpans.doc());
>>                        }
>>
>>                        int count = 0;
>>
>>                        while (moreExclude // while exclude is before
>>                                 &&(includeSpans.doc() == 
>> excludeSpans.doc())) {
>>                            if ((!(excludeSpans.end() <= 
>> includeSpans.start()))) {
>>                                count += 1;
>>
>>                                if (count > proximity) {
>>                                    break;
>>                                }
>>                            }
>>
>>                            moreExclude = excludeSpans.next(); // 
>> increment exclude
>>                        }
>>
>>                        if (!moreExclude // if no intersection
>>                                 ||(includeSpans.doc() != 
>> excludeSpans.doc()) ||
>>                                (includeSpans.end() <= 
>> excludeSpans.start())) {
>>                            break; // we found a match
>>                        }
>>
>>                        moreInclude = includeSpans.next(); // 
>> intersected: keep scanning
>>                    }
>>
>>                    return moreInclude;
>>                }
>>
>>                public boolean skipTo(int target) throws IOException {
>>                    if (moreInclude) { // skip include
>>                        moreInclude = includeSpans.skipTo(target);
>>                    }
>>
>>                    if (!moreInclude) {
>>                        return false;
>>                    }
>>
>>                    if (moreExclude // skip exclude
>>                             &&(includeSpans.doc() > 
>> excludeSpans.doc())) {
>>                        moreExclude = 
>> excludeSpans.skipTo(includeSpans.doc());
>>                    }
>>
>>                    int count = 0;
>>
>>                    while (moreExclude // while exclude is before
>>                             &&(includeSpans.doc() == 
>> excludeSpans.doc())) {
>>                        if ((!(excludeSpans.end() <= 
>> includeSpans.start()))) {
>>                            count += 1;
>>
>>                            if (count > proximity) {
>>                                break;
>>                            }
>>                        }
>>
>>                        moreExclude = excludeSpans.next(); // 
>> increment exclude
>>                    }
>>
>>                    if (!moreExclude // if no intersection
>>                             ||(includeSpans.doc() != 
>> excludeSpans.doc()) ||
>>                            (includeSpans.end() <= 
>> excludeSpans.start())) {
>>                        return true; // we found a match
>>                    }
>>
>>                    boolean returnboolean = next();
>>
>>                    return returnboolean; // scan to next match
>>                }
>>
>>                public int doc() {
>>                    return includeSpans.doc();
>>                }
>>
>>                public int start() {
>>                    return includeSpans.start();
>>                }
>>
>>                public int end() {
>>                    return includeSpans.end();
>>                }
>>
>>                public String toString() {
>>                    return "spans(" + SpanWithinQuery.this.toString() 
>> + ")";
>>                }
>>            };
>>    }
>>
>>    public Query rewrite(IndexReader reader) throws IOException {
>>        SpanWithinQuery clone = null;
>>
>>        SpanQuery rewrittenInclude = (SpanQuery) include.rewrite(reader);
>>
>>        if (rewrittenInclude != include) {
>>            clone = (SpanWithinQuery) this.clone();
>>            clone.include = rewrittenInclude;
>>        }
>>
>>        SpanQuery rewrittenExclude = (SpanQuery) exclude.rewrite(reader);
>>
>>        if (rewrittenExclude != exclude) {
>>            if (clone == null) {
>>                clone = (SpanWithinQuery) this.clone();
>>            }
>>
>>            clone.exclude = rewrittenExclude;
>>        }
>>
>>        if (clone != null) {
>>            return clone; // some clauses rewrote
>>        } else {
>>            return this; // no clauses rewrote
>>        }
>>    }
>>
>>    /** Returns true iff <code>o</code> is equal to this. */
>>    public boolean equals(Object o) {
>>        if (this == o) {
>>            return true;
>>        }
>>
>>        if (!(o instanceof SpanWithinQuery)) {
>>            return false;
>>        }
>>
>>        SpanWithinQuery other = (SpanWithinQuery) o;
>>
>>        return this.include.equals(other.include) &&
>>        this.exclude.equals(other.exclude) &&
>>        (this.getBoost() == other.getBoost()) && (proximity == 
>> other.proximity);
>>    }
>>
>>    public int hashCode() {
>>        int h = include.hashCode();
>>        h = (h << 1) | (h >>> 31); // rotate left
>>        h ^= exclude.hashCode();
>>        h = (h << 1) | (h >>> 31); // rotate left
>>        h ^= Float.floatToRawIntBits(getBoost());
>>        h ^= proximity;
>>
>>        return h;
>>    }
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Queries spanning paragraphs

Posted by John Byrne <jo...@propylon.com>.

Thanks for that, that's exactly what I needed.

Actually, I hadn't heard of qsol, but it seems to solve a few other 
problems I have as well - correct highlighting, configurable operators, 
sentence recogition. Is it distributed under the Apache license? and is 
it currently stable enough to use out-of-the-box?

Cheers,
John


Mark Miller wrote:
> I implemented this for my qsol query parser: myhardshadow.com/qsol
>
> Uses a modified SpanNotQuery that takes another parameter saying how 
> many times the span can cross the specified marker. Index a special 
> paragraph marker with your text to delimit paragraphs and then the 
> rest is easy.
>
> - Mark
>
> public class SpanWithinQuery extends SpanQuery {
>
>    private SpanQuery include;
>    private SpanQuery exclude;
>    private int proximity;
>
>    /** Construct a SpanWithinQuery matching spans from 
> <code>include</code> which
>     *  overlap with spans from <code>exclude</code> up to 
> <code>proximity</code> times.*/
>    public SpanWithinQuery(SpanQuery include, SpanQuery exclude, int 
> proximity) {
>        this.include = include;
>        this.exclude = exclude;
>        this.proximity = proximity;
>
>        if (!include.getField().equals(exclude.getField())) {
>            throw new IllegalArgumentException("Clauses must have same 
> field.");
>        }
>    }
>
>    /** Return the SpanQuery whose matches are filtered. */
>    public SpanQuery getInclude() {
>        return include;
>    }
>
>    /** Return the SpanQuery whose matches must not overlap those 
> returned. */
>    public SpanQuery getExclude() {
>        return exclude;
>    }
>
>    public String getField() {
>        return include.getField();
>    }
>
>    /** Returns a collection of all terms matched by this query.
>     * @deprecated use extractTerms instead
>     * @see #extractTerms(Set)
>     */
>    public Collection getTerms() {
>        return include.getTerms();
>    }
>
>    public void extractTerms(Set terms) {
>        include.extractTerms(terms);
>    }
>
>    public String toString(String field) {
>        StringBuffer buffer = new StringBuffer();
>        buffer.append("spanWithin(");
>        buffer.append(include.toString(field));
>        buffer.append(", ");
>        buffer.append(proximity + " ,");
>        buffer.append(exclude.toString(field));
>        buffer.append(")");
>        buffer.append(ToStringUtils.boost(getBoost()));
>
>        return buffer.toString();
>    }
>
>    public Spans getSpans(final IndexReader reader) throws IOException {
>        return new Spans() {
>                private Spans includeSpans = include.getSpans(reader);
>                private boolean moreInclude = true;
>                private Spans excludeSpans = exclude.getSpans(reader);
>                private boolean moreExclude = true;
>
>                public boolean next() throws IOException {
>                    if (moreInclude) { // move to next include
>                        moreInclude = includeSpans.next();
>                    }
>
>                    while (moreInclude && moreExclude) {
>                        if (includeSpans.doc() > excludeSpans.doc()) { 
> // skip exclude
>                            moreExclude = 
> excludeSpans.skipTo(includeSpans.doc());
>                        }
>
>                        int count = 0;
>
>                        while (moreExclude // while exclude is before
>                                 &&(includeSpans.doc() == 
> excludeSpans.doc())) {
>                            if ((!(excludeSpans.end() <= 
> includeSpans.start()))) {
>                                count += 1;
>
>                                if (count > proximity) {
>                                    break;
>                                }
>                            }
>
>                            moreExclude = excludeSpans.next(); // 
> increment exclude
>                        }
>
>                        if (!moreExclude // if no intersection
>                                 ||(includeSpans.doc() != 
> excludeSpans.doc()) ||
>                                (includeSpans.end() <= 
> excludeSpans.start())) {
>                            break; // we found a match
>                        }
>
>                        moreInclude = includeSpans.next(); // 
> intersected: keep scanning
>                    }
>
>                    return moreInclude;
>                }
>
>                public boolean skipTo(int target) throws IOException {
>                    if (moreInclude) { // skip include
>                        moreInclude = includeSpans.skipTo(target);
>                    }
>
>                    if (!moreInclude) {
>                        return false;
>                    }
>
>                    if (moreExclude // skip exclude
>                             &&(includeSpans.doc() > 
> excludeSpans.doc())) {
>                        moreExclude = 
> excludeSpans.skipTo(includeSpans.doc());
>                    }
>
>                    int count = 0;
>
>                    while (moreExclude // while exclude is before
>                             &&(includeSpans.doc() == 
> excludeSpans.doc())) {
>                        if ((!(excludeSpans.end() <= 
> includeSpans.start()))) {
>                            count += 1;
>
>                            if (count > proximity) {
>                                break;
>                            }
>                        }
>
>                        moreExclude = excludeSpans.next(); // increment 
> exclude
>                    }
>
>                    if (!moreExclude // if no intersection
>                             ||(includeSpans.doc() != 
> excludeSpans.doc()) ||
>                            (includeSpans.end() <= 
> excludeSpans.start())) {
>                        return true; // we found a match
>                    }
>
>                    boolean returnboolean = next();
>
>                    return returnboolean; // scan to next match
>                }
>
>                public int doc() {
>                    return includeSpans.doc();
>                }
>
>                public int start() {
>                    return includeSpans.start();
>                }
>
>                public int end() {
>                    return includeSpans.end();
>                }
>
>                public String toString() {
>                    return "spans(" + SpanWithinQuery.this.toString() + 
> ")";
>                }
>            };
>    }
>
>    public Query rewrite(IndexReader reader) throws IOException {
>        SpanWithinQuery clone = null;
>
>        SpanQuery rewrittenInclude = (SpanQuery) include.rewrite(reader);
>
>        if (rewrittenInclude != include) {
>            clone = (SpanWithinQuery) this.clone();
>            clone.include = rewrittenInclude;
>        }
>
>        SpanQuery rewrittenExclude = (SpanQuery) exclude.rewrite(reader);
>
>        if (rewrittenExclude != exclude) {
>            if (clone == null) {
>                clone = (SpanWithinQuery) this.clone();
>            }
>
>            clone.exclude = rewrittenExclude;
>        }
>
>        if (clone != null) {
>            return clone; // some clauses rewrote
>        } else {
>            return this; // no clauses rewrote
>        }
>    }
>
>    /** Returns true iff <code>o</code> is equal to this. */
>    public boolean equals(Object o) {
>        if (this == o) {
>            return true;
>        }
>
>        if (!(o instanceof SpanWithinQuery)) {
>            return false;
>        }
>
>        SpanWithinQuery other = (SpanWithinQuery) o;
>
>        return this.include.equals(other.include) &&
>        this.exclude.equals(other.exclude) &&
>        (this.getBoost() == other.getBoost()) && (proximity == 
> other.proximity);
>    }
>
>    public int hashCode() {
>        int h = include.hashCode();
>        h = (h << 1) | (h >>> 31); // rotate left
>        h ^= exclude.hashCode();
>        h = (h << 1) | (h >>> 31); // rotate left
>        h ^= Float.floatToRawIntBits(getBoost());
>        h ^= proximity;
>
>        return h;
>    }
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Queries spanning paragraphs

Posted by Mark Miller <ma...@gmail.com>.

I implemented this for my qsol query parser: myhardshadow.com/qsol

Uses a modified SpanNotQuery that takes another parameter saying how 
many times the span can cross the specified marker. Index a special 
paragraph marker with your text to delimit paragraphs and then the rest 
is easy.

- Mark

public class SpanWithinQuery extends SpanQuery {
 
    private SpanQuery include;
    private SpanQuery exclude;
    private int proximity;

    /** Construct a SpanWithinQuery matching spans from 
<code>include</code> which
     *  overlap with spans from <code>exclude</code> up to 
<code>proximity</code> times.*/
    public SpanWithinQuery(SpanQuery include, SpanQuery exclude, int 
proximity) {
        this.include = include;
        this.exclude = exclude;
        this.proximity = proximity;

        if (!include.getField().equals(exclude.getField())) {
            throw new IllegalArgumentException("Clauses must have same 
field.");
        }
    }

    /** Return the SpanQuery whose matches are filtered. */
    public SpanQuery getInclude() {
        return include;
    }

    /** Return the SpanQuery whose matches must not overlap those 
returned. */
    public SpanQuery getExclude() {
        return exclude;
    }

    public String getField() {
        return include.getField();
    }

    /** Returns a collection of all terms matched by this query.
     * @deprecated use extractTerms instead
     * @see #extractTerms(Set)
     */
    public Collection getTerms() {
        return include.getTerms();
    }

    public void extractTerms(Set terms) {
        include.extractTerms(terms);
    }

    public String toString(String field) {
        StringBuffer buffer = new StringBuffer();
        buffer.append("spanWithin(");
        buffer.append(include.toString(field));
        buffer.append(", ");
        buffer.append(proximity + " ,");
        buffer.append(exclude.toString(field));
        buffer.append(")");
        buffer.append(ToStringUtils.boost(getBoost()));

        return buffer.toString();
    }

    public Spans getSpans(final IndexReader reader) throws IOException {
        return new Spans() {
                private Spans includeSpans = include.getSpans(reader);
                private boolean moreInclude = true;
                private Spans excludeSpans = exclude.getSpans(reader);
                private boolean moreExclude = true;

                public boolean next() throws IOException {
                    if (moreInclude) { // move to next include
                        moreInclude = includeSpans.next();
                    }

                    while (moreInclude && moreExclude) {
                        if (includeSpans.doc() > excludeSpans.doc()) { 
// skip exclude
                            moreExclude = 
excludeSpans.skipTo(includeSpans.doc());
                        }

                        int count = 0;

                        while (moreExclude // while exclude is before
                                 &&(includeSpans.doc() == 
excludeSpans.doc())) {
                            if ((!(excludeSpans.end() <= 
includeSpans.start()))) {
                                count += 1;

                                if (count > proximity) {
                                    break;
                                }
                            }

                            moreExclude = excludeSpans.next(); // 
increment exclude
                        }

                        if (!moreExclude // if no intersection
                                 ||(includeSpans.doc() != 
excludeSpans.doc()) ||
                                (includeSpans.end() <= 
excludeSpans.start())) {
                            break; // we found a match
                        }

                        moreInclude = includeSpans.next(); // 
intersected: keep scanning
                    }

                    return moreInclude;
                }

                public boolean skipTo(int target) throws IOException {
                    if (moreInclude) { // skip include
                        moreInclude = includeSpans.skipTo(target);
                    }

                    if (!moreInclude) {
                        return false;
                    }

                    if (moreExclude // skip exclude
                             &&(includeSpans.doc() > excludeSpans.doc())) {
                        moreExclude = 
excludeSpans.skipTo(includeSpans.doc());
                    }

                    int count = 0;

                    while (moreExclude // while exclude is before
                             &&(includeSpans.doc() == excludeSpans.doc())) {
                        if ((!(excludeSpans.end() <= 
includeSpans.start()))) {
                            count += 1;

                            if (count > proximity) {
                                break;
                            }
                        }

                        moreExclude = excludeSpans.next(); // increment 
exclude
                    }

                    if (!moreExclude // if no intersection
                             ||(includeSpans.doc() != excludeSpans.doc()) ||
                            (includeSpans.end() <= excludeSpans.start())) {
                        return true; // we found a match
                    }

                    boolean returnboolean = next();

                    return returnboolean; // scan to next match
                }

                public int doc() {
                    return includeSpans.doc();
                }

                public int start() {
                    return includeSpans.start();
                }

                public int end() {
                    return includeSpans.end();
                }

                public String toString() {
                    return "spans(" + SpanWithinQuery.this.toString() + ")";
                }
            };
    }

    public Query rewrite(IndexReader reader) throws IOException {
        SpanWithinQuery clone = null;

        SpanQuery rewrittenInclude = (SpanQuery) include.rewrite(reader);

        if (rewrittenInclude != include) {
            clone = (SpanWithinQuery) this.clone();
            clone.include = rewrittenInclude;
        }

        SpanQuery rewrittenExclude = (SpanQuery) exclude.rewrite(reader);

        if (rewrittenExclude != exclude) {
            if (clone == null) {
                clone = (SpanWithinQuery) this.clone();
            }

            clone.exclude = rewrittenExclude;
        }

        if (clone != null) {
            return clone; // some clauses rewrote
        } else {
            return this; // no clauses rewrote
        }
    }

    /** Returns true iff <code>o</code> is equal to this. */
    public boolean equals(Object o) {
        if (this == o) {
            return true;
        }

        if (!(o instanceof SpanWithinQuery)) {
            return false;
        }

        SpanWithinQuery other = (SpanWithinQuery) o;

        return this.include.equals(other.include) &&
        this.exclude.equals(other.exclude) &&
        (this.getBoost() == other.getBoost()) && (proximity == 
other.proximity);
    }

    public int hashCode() {
        int h = include.hashCode();
        h = (h << 1) | (h >>> 31); // rotate left
        h ^= exclude.hashCode();
        h = (h << 1) | (h >>> 31); // rotate left
        h ^= Float.floatToRawIntBits(getBoost());
        h ^= proximity;

        return h;
    }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org