You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Cedric Ho <ce...@gmail.com> on 2007/07/18 05:58:44 UTC

WildcardQuery and SpanQuery

Hi everybody,

We recently need to support wildcard search terms "*", "?" together
with SpanQuery. It seems that there's no SpanWildcardQuery available.
After looking into the lucene source code for a while, I guess we can
either:

1. Use SpanRegexQuery, or

2. Write our own SpanWildcardQuery, and implements the
rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
with some SpanTermQuery.

Of the two approaches, Option 1 seems to be easier. But I am rather
concerned about the performance of using regular expression. On the
other hand, I am not sure if there are any other concerns I am not
aware of for option 2 (i.e. is there a reason why there's no
SpanWildcardQuery in the first place?)

Any advices ?

Cedric

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: WildcardQuery and SpanQuery

Posted by Mark Miller <ma...@gmail.com>.
You could give this a shot (From my Qsol query parser):

package com.mhs.qsol.spans;

/**
 * Copyright 2006 Mark Miller (markrmiller@gmail.com)
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Set;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.spans.SpanOrQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;

/**
 * @author mark miller
 *
 */
public class SpanWildcardQuery extends SpanQuery {
    private Term term;

    private BooleanQuery rewrittenWildQuery;

    public SpanWildcardQuery(Term term) {
        this.term = term;
    }

    public Term getTerm() {
        return term;
    }

    public Query rewrite(IndexReader reader) throws IOException {
        WildcardQuery wildQuery = new WildcardQuery(term);

        rewrittenWildQuery = (BooleanQuery) wildQuery.rewrite(reader);

        BooleanQuery bq = (BooleanQuery) rewrittenWildQuery.rewrite(reader);

        BooleanClause[] clauses = bq.getClauses();
        SpanQuery[] sqs = new SpanQuery[clauses.length];

        for (int i = 0; i < clauses.length; i++) {
            BooleanClause clause = clauses[i];

            TermQuery tq = (TermQuery) clause.getQuery();

            sqs[i] = new SpanTermQuery(tq.getTerm());
            sqs[i].setBoost(tq.getBoost());
        }

        SpanOrQuery query = new SpanOrQuery(sqs);
        query.setBoost(wildQuery.getBoost());

        return query;
    }

    public Spans getSpans(IndexReader reader) throws IOException {
        throw new UnsupportedOperationException(
                "Query should have been rewritten");
    }

    public String getField() {
        return term.field();
    }

    /**
     * @deprecated use extractTerms instead
     * @see #extractTerms(Set);
     */
    public Collection getTerms() {
        Collection terms = new ArrayList();
        terms.add(term);

        return terms;
    }

    public void extractTerms(Set terms) {
        terms.add(term);
    }

    public String toString(String field) {
        StringBuffer buffer = new StringBuffer();
        buffer.append("spanWildcardQuery(");
        buffer.append(term);
        buffer.append(")");

        // buffer.append(ToStringUtils.boost(getBoost()));
        return buffer.toString();
    }
}


Cedric Ho wrote:
> Hi everybody,
>
> We recently need to support wildcard search terms "*", "?" together
> with SpanQuery. It seems that there's no SpanWildcardQuery available.
> After looking into the lucene source code for a while, I guess we can
> either:
>
> 1. Use SpanRegexQuery, or
>
> 2. Write our own SpanWildcardQuery, and implements the
> rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
> with some SpanTermQuery.
>
> Of the two approaches, Option 1 seems to be easier. But I am rather
> concerned about the performance of using regular expression. On the
> other hand, I am not sure if there are any other concerns I am not
> aware of for option 2 (i.e. is there a reason why there's no
> SpanWildcardQuery in the first place?)
>
> Any advices ?
>
> Cedric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: WildcardQuery and SpanQuery

Posted by Cedric Ho <ce...@gmail.com>.
Thanks so much for helping ~ I will try it out tomorrow.

Regards,
Cedric


On 7/19/07, Paul Elschot <pa...@xs4all.nl> wrote:
> On Wednesday 18 July 2007 12:30, Cedric Ho wrote:
> > Thanks for the quick response Paul =)
> >
> > However I am lost while looking at the surround package.
>
> That is not really surprising, the code is factored to the bone, and it
> is hardly documented.
> You could have a look at the test code to start.
> Also the surround.txt file in the contrib/surround directory should
> be helpful.
>
> > Are you
> > suggesting I can solve my problem at hand using the surround package?
>
> In case the surround syntax fits what you need, you might use the surround
> package.
>
> You could also use your own parser and target the
> o.a.l.queryParser.surround.query package.
> The code posted by Mark Miller may solve your problem, too.
>
> Regards,
> Paul Elschot
>
>
> >
> >
> > On 7/18/07, Paul Elschot <pa...@xs4all.nl> wrote:
> > > On Wednesday 18 July 2007 05:58, Cedric Ho wrote:
> > > > Hi everybody,
> > > >
> > > > We recently need to support wildcard search terms "*", "?" together
> > > > with SpanQuery. It seems that there's no SpanWildcardQuery available.
> > > > After looking into the lucene source code for a while, I guess we can
> > > > either:
> > > >
> > > > 1. Use SpanRegexQuery, or
> > > >
> > > > 2. Write our own SpanWildcardQuery, and implements the
> > > > rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
> > > > with some SpanTermQuery.
> > > >
> > > > Of the two approaches, Option 1 seems to be easier. But I am rather
> > > > concerned about the performance of using regular expression. On the
> > > > other hand, I am not sure if there are any other concerns I am not
> > > > aware of for option 2 (i.e. is there a reason why there's no
> > > > SpanWildcardQuery in the first place?)
> > > >
> > > > Any advices ?
> > >
> > > The basic problem you are facing is that in Lucene
> > > the expansion of the terms is tightly coupled to the generation
> > > of a combination query using the expanded terms.
> > >
> > > In contrib/surround the term expansion and query generation
> > > are decoupled using a visitor pattern for the terms. The code is here:
> > >
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query
> > >
> > > In surround a wild card term can provide either an OR of
> > > normal term queries, or a SpanOrQuery of span term queries.
> > > This query generation is in class SimpleTerm, which has one method
> > > for a normal boolean OR query over the terms, and one for
> > > a span query for the terms.
> > >
> > > In both cases surround uses a regular expression to expand
> > > the matching terms, but that could be changed to use
> > > another wildcard expansion mechanisms than the ones in
> > > SrndPrefixQuery and SrndTruncQuery, which
> > > are subclasses of SimpleTerm.
> > >
> > > With the term expansion and the query combination split,
> > > it is also necessary to limit the maximum number of expanded
> > > terms in another way than Lucene does. In surround the
> > > classes BasicQueryFactory and TooManyBasicQueries are
> > > used for that.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > >
> > >
> > > >
> > > > Cedric
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: WildcardQuery and SpanQuery

Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 18 July 2007 12:30, Cedric Ho wrote:
> Thanks for the quick response Paul =)
> 
> However I am lost while looking at the surround package.

That is not really surprising, the code is factored to the bone, and it
is hardly documented. 
You could have a look at the test code to start.
Also the surround.txt file in the contrib/surround directory should
be helpful.

> Are you 
> suggesting I can solve my problem at hand using the surround package?

In case the surround syntax fits what you need, you might use the surround
package.

You could also use your own parser and target the
o.a.l.queryParser.surround.query package.
The code posted by Mark Miller may solve your problem, too.

Regards,
Paul Elschot


> 
> 
> On 7/18/07, Paul Elschot <pa...@xs4all.nl> wrote:
> > On Wednesday 18 July 2007 05:58, Cedric Ho wrote:
> > > Hi everybody,
> > >
> > > We recently need to support wildcard search terms "*", "?" together
> > > with SpanQuery. It seems that there's no SpanWildcardQuery available.
> > > After looking into the lucene source code for a while, I guess we can
> > > either:
> > >
> > > 1. Use SpanRegexQuery, or
> > >
> > > 2. Write our own SpanWildcardQuery, and implements the
> > > rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
> > > with some SpanTermQuery.
> > >
> > > Of the two approaches, Option 1 seems to be easier. But I am rather
> > > concerned about the performance of using regular expression. On the
> > > other hand, I am not sure if there are any other concerns I am not
> > > aware of for option 2 (i.e. is there a reason why there's no
> > > SpanWildcardQuery in the first place?)
> > >
> > > Any advices ?
> >
> > The basic problem you are facing is that in Lucene
> > the expansion of the terms is tightly coupled to the generation
> > of a combination query using the expanded terms.
> >
> > In contrib/surround the term expansion and query generation
> > are decoupled using a visitor pattern for the terms. The code is here:
> > 
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query
> >
> > In surround a wild card term can provide either an OR of
> > normal term queries, or a SpanOrQuery of span term queries.
> > This query generation is in class SimpleTerm, which has one method
> > for a normal boolean OR query over the terms, and one for
> > a span query for the terms.
> >
> > In both cases surround uses a regular expression to expand
> > the matching terms, but that could be changed to use
> > another wildcard expansion mechanisms than the ones in
> > SrndPrefixQuery and SrndTruncQuery, which
> > are subclasses of SimpleTerm.
> >
> > With the term expansion and the query combination split,
> > it is also necessary to limit the maximum number of expanded
> > terms in another way than Lucene does. In surround the
> > classes BasicQueryFactory and TooManyBasicQueries are
> > used for that.
> >
> > Regards,
> > Paul Elschot
> >
> >
> >
> > >
> > > Cedric
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: WildcardQuery and SpanQuery

Posted by Cedric Ho <ce...@gmail.com>.
Thanks for the quick response Paul =)

However I am lost while looking at the surround package. Are you
suggesting I can solve my problem at hand using the surround package?


On 7/18/07, Paul Elschot <pa...@xs4all.nl> wrote:
> On Wednesday 18 July 2007 05:58, Cedric Ho wrote:
> > Hi everybody,
> >
> > We recently need to support wildcard search terms "*", "?" together
> > with SpanQuery. It seems that there's no SpanWildcardQuery available.
> > After looking into the lucene source code for a while, I guess we can
> > either:
> >
> > 1. Use SpanRegexQuery, or
> >
> > 2. Write our own SpanWildcardQuery, and implements the
> > rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
> > with some SpanTermQuery.
> >
> > Of the two approaches, Option 1 seems to be easier. But I am rather
> > concerned about the performance of using regular expression. On the
> > other hand, I am not sure if there are any other concerns I am not
> > aware of for option 2 (i.e. is there a reason why there's no
> > SpanWildcardQuery in the first place?)
> >
> > Any advices ?
>
> The basic problem you are facing is that in Lucene
> the expansion of the terms is tightly coupled to the generation
> of a combination query using the expanded terms.
>
> In contrib/surround the term expansion and query generation
> are decoupled using a visitor pattern for the terms. The code is here:
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query
>
> In surround a wild card term can provide either an OR of
> normal term queries, or a SpanOrQuery of span term queries.
> This query generation is in class SimpleTerm, which has one method
> for a normal boolean OR query over the terms, and one for
> a span query for the terms.
>
> In both cases surround uses a regular expression to expand
> the matching terms, but that could be changed to use
> another wildcard expansion mechanisms than the ones in
> SrndPrefixQuery and SrndTruncQuery, which
> are subclasses of SimpleTerm.
>
> With the term expansion and the query combination split,
> it is also necessary to limit the maximum number of expanded
> terms in another way than Lucene does. In surround the
> classes BasicQueryFactory and TooManyBasicQueries are
> used for that.
>
> Regards,
> Paul Elschot
>
>
>
> >
> > Cedric
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: WildcardQuery and SpanQuery

Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 18 July 2007 05:58, Cedric Ho wrote:
> Hi everybody,
> 
> We recently need to support wildcard search terms "*", "?" together
> with SpanQuery. It seems that there's no SpanWildcardQuery available.
> After looking into the lucene source code for a while, I guess we can
> either:
> 
> 1. Use SpanRegexQuery, or
> 
> 2. Write our own SpanWildcardQuery, and implements the
> rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
> with some SpanTermQuery.
> 
> Of the two approaches, Option 1 seems to be easier. But I am rather
> concerned about the performance of using regular expression. On the
> other hand, I am not sure if there are any other concerns I am not
> aware of for option 2 (i.e. is there a reason why there's no
> SpanWildcardQuery in the first place?)
> 
> Any advices ?

The basic problem you are facing is that in Lucene
the expansion of the terms is tightly coupled to the generation
of a combination query using the expanded terms.

In contrib/surround the term expansion and query generation
are decoupled using a visitor pattern for the terms. The code is here:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query

In surround a wild card term can provide either an OR of 
normal term queries, or a SpanOrQuery of span term queries.
This query generation is in class SimpleTerm, which has one method
for a normal boolean OR query over the terms, and one for
a span query for the terms.

In both cases surround uses a regular expression to expand 
the matching terms, but that could be changed to use
another wildcard expansion mechanisms than the ones in
SrndPrefixQuery and SrndTruncQuery, which
are subclasses of SimpleTerm.

With the term expansion and the query combination split,
it is also necessary to limit the maximum number of expanded
terms in another way than Lucene does. In surround the
classes BasicQueryFactory and TooManyBasicQueries are
used for that.

Regards,
Paul Elschot



> 
> Cedric
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org