You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Robert Watkins <rw...@foo-bar.org> on 2005/10/11 15:32:26 UTC

wildcards within a phrase query

I've been trying to figure out the best way to support queries of the ilk:

   "going to he* in a hand-basket"

such that it's almost a PhraseQuery, except that the third term (in this
case) is a PrefixQuery.

The only idea that comes to mind is to try to combine a PhraseQuery and
a PrefixQuery (or, in other situations, a WildcardQuery) using relative
positional parameters. The idea being that one would combine a PhraseQuery
for:

   "going to [gap] in a hand-basket"

with a PrefixQuery for:

   "he"

such that the position of the Term used in the PrefixQuery fits into the
slot left by the [gap] of the PhraseQuery.

Does that sound like I'm on the right track?

-- Robert

--------------------
Robert Watkins
rwatkins@foo-bar.org
--------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: wildcards within a phrase query

Posted by Robert Watkins <rw...@foo-bar.org>.
Thank you, Daniel. Once I get an environment suitable enough I will
certainly give this a whirl.

-- Robert

On Wed, 12 Oct 2005, Daniel Naber wrote:

> On Mittwoch 12 Oktober 2005 17:18, Robert Watkins wrote:
>
>> Does that sound reasonable -- and scalable -- to you?
>
> I don't think you need to iterate at all, you can easily expand the terms
> of a query:
>
>    QueryParser qp = new QueryParser("f", new StandardAnalyzer());
>    Query q = qp.parse("e*");
>    IndexReader ir = IndexReader.open("/tmp/index");
>    System.out.println(q.rewrite(ir));
>
> This will expand e* to all of your indice's words that start with "e". Same
> for wildcard queries.
>
> I'll leave the guessing about performance to others. Why not just give it a
> try when it's so easy?
>
> Regards
> Daniel
>
> -- 
> http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: wildcards within a phrase query

Posted by Daniel Naber <lu...@danielnaber.de>.
On Mittwoch 12 Oktober 2005 17:18, Robert Watkins wrote:

> Does that sound reasonable -- and scalable -- to you?

I don't think you need to iterate at all, you can easily expand the terms 
of a query:

    QueryParser qp = new QueryParser("f", new StandardAnalyzer());
    Query q = qp.parse("e*");
    IndexReader ir = IndexReader.open("/tmp/index");
    System.out.println(q.rewrite(ir));

This will expand e* to all of your indice's words that start with "e". Same 
for wildcard queries.

I'll leave the guessing about performance to others. Why not just give it a 
try when it's so easy?

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: wildcards within a phrase query

Posted by Robert Watkins <rw...@foo-bar.org>.
Having now looked at the test cases in SVN (specifically,
TestMultiPhraseQuery.java), I cannot see any tests using simple
wildcards, only terms ending with *, and thus suitable for a
PrefixQuery. The examples do reveal how it could be done for wildcards,
but my concern turns to scalability.

I am only in the beginning stages of creating a prototype, so I don't
have a suitable test environment for this sort of thing yet, but when
it comes to fruition there will be 11 indexes, with anywhere from
5,000 to 600,000 documents in each index. Based on a test index with 90
short documents in it, I can easily expect upwards of 1,000 terms per
document. While there would likely be more repetition of terms as the
number of documents in an index increases, that's still a lot of terms.

Also, any given query will need to be able to search across any number
of those indexes. As such, to create a query for, say "bl?nd ambition"
it looks as if one would have to do something like:

   // search for "bl?nd ambition":
   MultiPhraseQuery query = new MultiPhraseQuery();
   String prefix, regex;
   LinkedList termsWithPrefix = new LinkedList();
   TermEnum te;
   while (queryTermEnum.hasNext()) {
       String queryTerm = (String)queryTermEnum.next();
       if (hasWildcard(queryTerm)) {
           // get "bl" from "bl?nd"
           prefix = getWildcardPrefix(queryTerm);
           // get "bl.nd" from "bl?nd"
           regex  = getWildcardRegex(queryTerm);
           termsWithPrefix.clear();
           for (int i = 0; i< openIndexReaders.length; i++) {
               te = openIndexReaders[i].terms(new Term("body", prefix));
               do {
                   if (te.term().text().matches(regex)) {
                       termsWithPrefix.add(te.term());
                   }
               } while (te.next());
           }
           query.add((Term[])termsWithPrefix.toArray(new Term[0]));
       }
       else {
           query.add(new Term("body", queryTerm));
       }
   }

Does that sound reasonable -- and scalable -- to you?
-- Robert

PS -- Would it be possible to avoid going through _all_ the terms in the
TermEnum (that are greater than prefix, of course) by doing something
like:

   } while (te.next() && te.term().text().startsWith(prefix));

or would analysis possibly make that unwise?


On Wed, 12 Oct 2005, Daniel Naber wrote:

> On Mittwoch 12 Oktober 2005 00:15, Robert Watkins wrote:
>
>> Wonderful! But what about wildcards? I realised after I had sent the
>> last message that my pattern should have been written:
>>
>>   ( term | term as prefix | wildcard term )+
>
> Have a look at the test cases: you need to expand the terms yourself, i.e.
> it doesn't matter if there's a prefix or wildcard term. There's no support
> for *direct* input of something like (a phrase query) "foo* bar".
>
> Regards
> Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: wildcards within a phrase query

Posted by Daniel Naber <lu...@danielnaber.de>.
On Mittwoch 12 Oktober 2005 00:15, Robert Watkins wrote:

> Wonderful! But what about wildcards? I realised after I had sent the
> last message that my pattern should have been written:

Have a look at the test cases: you need to expand the terms yourself, i.e. 
it doesn't matter if there's a prefix or wildcard term. There's no support 
for *direct* input of something like (a phrase query) "foo* bar".

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: wildcards within a phrase query

Posted by Robert Watkins <rw...@foo-bar.org>.
Wonderful! But what about wildcards? I realised after I had sent the
last message that my pattern should have been written:

   ( term | term as prefix | wildcard term )+

-- Robert

On Tue, 11 Oct 2005, Daniel Naber wrote:

> On Dienstag 11 Oktober 2005 22:53, Robert Watkins wrote:
>
>> I was under the impression that PhrasePrefixQuery only worked in the
>> special case of the term that would otherwise be used in a PrefixQuery
>> coming at the end of the sequence of terms, as in:
>
> No, the test cases show that the prefix term can be anywhere (at least the
> test cases in SVN do, where the class has been renamed MultiPhraseQuery).
>
> Regards
> Daniel
>
> -- 
> http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: wildcards within a phrase query

Posted by Daniel Naber <lu...@danielnaber.de>.
On Dienstag 11 Oktober 2005 22:53, Robert Watkins wrote:

> I was under the impression that PhrasePrefixQuery only worked in the
> special case of the term that would otherwise be used in a PrefixQuery
> coming at the end of the sequence of terms, as in:

No, the test cases show that the prefix term can be anywhere (at least the 
test cases in SVN do, where the class has been renamed MultiPhraseQuery).

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: wildcards within a phrase query

Posted by Robert Watkins <rw...@foo-bar.org>.
I was under the impression that PhrasePrefixQuery only worked in the
special case of the term that would otherwise be used in a PrefixQuery
coming at the end of the sequence of terms, as in:

   ( term )+ ( term as prefix )

but not where either a WildcardQuery or a PrefixQuery occurs anywhere
in the sequence of terms, such as:

   ( term | term as prefix )+

Am I missing something?

-- Robert

On Tue, 11 Oct 2005, Daniel Naber wrote:

> On Dienstag 11 Oktober 2005 15:32, Robert Watkins wrote:
>
>> The only idea that comes to mind is to try to combine a PhraseQuery and
>> a PrefixQuery
>
> Yes, PhrasePrefixQuery already supports that.
>
> Regards
> Daniel
>
> -- 
> http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: wildcards within a phrase query

Posted by Daniel Naber <lu...@danielnaber.de>.
On Dienstag 11 Oktober 2005 15:32, Robert Watkins wrote:

> The only idea that comes to mind is to try to combine a PhraseQuery and
> a PrefixQuery

Yes, PhrasePrefixQuery already supports that.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org