You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2006/01/05 23:09:14 UTC

"Starts with" query?

I'm throwing myself at the mercy of the lucene community, I'm a bit  
brain dead today after looking after a screaming 3 month old baby for  
4 hours last night...

We have a 'title' field indexed as Field.Text(...), which works  
nicely, and has lots of good searching.

However, this application is being ported from a DB based search,  
which had originally a "starts with" type search, and we need to  
support that.

So, given "The quick brown fox jumped of the caffeine addicted  
software developer", if the user types "The quick*" we need to find  
only those that have a title that starts with, and NOT match  
documents that have "The quick" as a sequence of terms later in the  
title (ie don't match "blah blah the quick blah blah").

Think SQL of " ....where title like 'The quick%'  ".

How do I do that with Lucene?  I'm sure this a is a dumb question,  
and I know that Lucene's searching is way more useful than that, but  
you know these pesky compatibility requirements.....It's screwing  
with my unit tests because the new index search results are getting  
more results that the old db method.

cheers,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Yonik Seeley <ys...@gmail.com>.
That's deprecated now of course... so you want MultiPhraseQuery.

-Yonik

On 1/5/06, Yonik Seeley <ys...@gmail.com> wrote:
> Check out PhrasePrefixQuery.
>
> -Yonik
>
>
> On 1/5/06, Paul Smith <ps...@aconex.com> wrote:
> > first off response to my own post, I meant PhraseQuery instead.
> >
> > But, since we're only tokenizing this field ,and not storing the
> > entire contents of the field, I'm not sure this is ever going to
> > work, is it?
> >
> > I notice that if I have a title "auto update", then the phrase query
> > trick works if it searches on
> >
> >         title:"0start0 auto*"
> >
> > but does not find any matches for
> >
> >         title:"0start0 aut*"
> >
> > I'm a bit stuck.
> >
> > Paul
> >
> > On 06/01/2006, at 10:43 AM, Paul Smith wrote:
> >
> > >> 2) index a magic token at the start of the title and include that
> > >> in a
> > >> phrase query:
> > >>    "_START_ the quick"
> > >
> > > Ok, I've gone and chose "0start0" as my start token, because our
> > > analyzer is stripping _.
> > >
> > > Now, second dumb question of the day, give the search for starts
> > > with "The qui*", that has to be turned into a prefix query, like so?
> > >
> > > new PrefixQuery(new Term("title", "0start0 " +  "The qui"))
> > >
> > > Is that the right approach?  To always prefix the search term
> > > string with the magic start text?
> > >
> > > I ask this because I'm getting weird results in my search now, as
> > > all documents are being matched.  When the query is finally run, it
> > > looks like this:
> > >
> > > +(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1
> > > +title:'0start0 f'* +cversion:1
> > >
> > > (you can ignore all but the title field in this case, the rest is
> > > correct for our app).
> > >
> > > Paul
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Yonik Seeley <ys...@gmail.com>.
Check out PhrasePrefixQuery.

-Yonik


On 1/5/06, Paul Smith <ps...@aconex.com> wrote:
> first off response to my own post, I meant PhraseQuery instead.
>
> But, since we're only tokenizing this field ,and not storing the
> entire contents of the field, I'm not sure this is ever going to
> work, is it?
>
> I notice that if I have a title "auto update", then the phrase query
> trick works if it searches on
>
>         title:"0start0 auto*"
>
> but does not find any matches for
>
>         title:"0start0 aut*"
>
> I'm a bit stuck.
>
> Paul
>
> On 06/01/2006, at 10:43 AM, Paul Smith wrote:
>
> >> 2) index a magic token at the start of the title and include that
> >> in a
> >> phrase query:
> >>    "_START_ the quick"
> >
> > Ok, I've gone and chose "0start0" as my start token, because our
> > analyzer is stripping _.
> >
> > Now, second dumb question of the day, give the search for starts
> > with "The qui*", that has to be turned into a prefix query, like so?
> >
> > new PrefixQuery(new Term("title", "0start0 " +  "The qui"))
> >
> > Is that the right approach?  To always prefix the search term
> > string with the magic start text?
> >
> > I ask this because I'm getting weird results in my search now, as
> > all documents are being matched.  When the query is finally run, it
> > looks like this:
> >
> > +(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1
> > +title:'0start0 f'* +cversion:1
> >
> > (you can ignore all but the title field in this case, the rest is
> > correct for our app).
> >
> > Paul
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 6, 2006, at 7:00 AM, Erik Hatcher wrote:
>> I notice that if I have a title "auto update", then the phrase  
>> query trick works if it searches on
>>
>> 	title:"0start0 auto*"
>>
>> but does not find any matches for
>>
>> 	title:"0start0 aut*"
>>
>> I'm a bit stuck.
>
> PhraseQuery does not handle wildcards.  Unfortunately this is  
> common misunderstanding.
>
> The MultiPhraseQuery could do this provided you expand "aut*" into  
> all the matching terms yourself.  But here is an alternative using  
> the new SpanRegexQuery (in contrib/regex):
>
>     RAMDirectory directory = new RAMDirectory();
>     IndexWriter writer = new IndexWriter(directory, new  
> SimpleAnalyzer(), true);
>     Document doc = new Document();
>     doc.add(new Field("field", "auto update", Field.Store.NO,  
> Field.Index.TOKENIZED));
>     writer.addDocument(doc);
>     doc = new Document();
>     doc.add(new Field("field", "first auto update", Field.Store.NO,  
> Field.Index.TOKENIZED));
>     writer.addDocument(doc);
>     writer.optimize();
>     writer.close();
>
>     IndexSearcher searcher = new IndexSearcher(directory);
>     SpanRegexQuery srq = new SpanRegexQuery(new Term("field",  
> "aut.*"));
>     SpanFirstQuery sfq = new SpanFirstQuery(srq, 1);
>     Hits hits = searcher.search(sfq);
>     assertEquals(1, hits.length());
>
> Notice that the query is "aut.*", not "aut*" such that it is a  
> valid regular expression for what you want.  In my current project,  
> my custom query parser handles * and ? like WildcardQuery, but  
> under the covers I simply convert that into a regex by replacing ?  
> with . and * with .*

Let me add a major caveat, especially given that Paul's index is  
large.  (Span)RegexQuery by default, currently, scans through *every*  
term in the index.  This is due to the complexity in determining the  
prefix of the regex.  While it is obvious that "aut.*" should only  
scan through terms starting with "aut", it gets more complicated with  
expressions like "a?uto" because the "a" is optional.  There is a  
Jakarta Regexp implementation in contrib/regex also and it is capable  
of determining the static prefix to reduce term enumeration, but I  
suspect java.util.regex is much faster than Jakarta Regexp.  I'm  
using, in my project, a blending of the two letting Jakarta Regexp  
determine the prefix but using java.util.regex for matching - this  
requires a custom, and trivial, implementation of RegexCapabilities.   
I didn't include that in contrib/regex because it seems a bit awkward  
for general consumption.

Anyway, caveat emptor for term enumeration with (Span)RegexQuery!    
Also, doing term rotation on indexing and with searching can also  
greatly reduce term enumeration even with leading wildcards - but  
I'll leave that as an exercise for the reader for now :)

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 5, 2006, at 7:01 PM, Paul Smith wrote:
> first off response to my own post, I meant PhraseQuery instead.
>
> But, since we're only tokenizing this field ,and not storing the  
> entire contents of the field, I'm not sure this is ever going to  
> work, is it?

Sure it will :)

> I notice that if I have a title "auto update", then the phrase  
> query trick works if it searches on
>
> 	title:"0start0 auto*"
>
> but does not find any matches for
>
> 	title:"0start0 aut*"
>
> I'm a bit stuck.

PhraseQuery does not handle wildcards.  Unfortunately this is common  
misunderstanding.

The MultiPhraseQuery could do this provided you expand "aut*" into  
all the matching terms yourself.  But here is an alternative using  
the new SpanRegexQuery (in contrib/regex):

     RAMDirectory directory = new RAMDirectory();
     IndexWriter writer = new IndexWriter(directory, new  
SimpleAnalyzer(), true);
     Document doc = new Document();
     doc.add(new Field("field", "auto update", Field.Store.NO,  
Field.Index.TOKENIZED));
     writer.addDocument(doc);
     doc = new Document();
     doc.add(new Field("field", "first auto update", Field.Store.NO,  
Field.Index.TOKENIZED));
     writer.addDocument(doc);
     writer.optimize();
     writer.close();

     IndexSearcher searcher = new IndexSearcher(directory);
     SpanRegexQuery srq = new SpanRegexQuery(new Term("field",  
"aut.*"));
     SpanFirstQuery sfq = new SpanFirstQuery(srq, 1);
     Hits hits = searcher.search(sfq);
     assertEquals(1, hits.length());

Notice that the query is "aut.*", not "aut*" such that it is a valid  
regular expression for what you want.  In my current project, my  
custom query parser handles * and ? like WildcardQuery, but under the  
covers I simply convert that into a regex by replacing ? with . and *  
with .*

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Paul Smith <ps...@aconex.com>.
first off response to my own post, I meant PhraseQuery instead.

But, since we're only tokenizing this field ,and not storing the  
entire contents of the field, I'm not sure this is ever going to  
work, is it?

I notice that if I have a title "auto update", then the phrase query  
trick works if it searches on

	title:"0start0 auto*"

but does not find any matches for

	title:"0start0 aut*"

I'm a bit stuck.

Paul

On 06/01/2006, at 10:43 AM, Paul Smith wrote:

>> 2) index a magic token at the start of the title and include that  
>> in a
>> phrase query:
>>    "_START_ the quick"
>
> Ok, I've gone and chose "0start0" as my start token, because our  
> analyzer is stripping _.
>
> Now, second dumb question of the day, give the search for starts  
> with "The qui*", that has to be turned into a prefix query, like so?
>
> new PrefixQuery(new Term("title", "0start0 " +  "The qui"))
>
> Is that the right approach?  To always prefix the search term  
> string with the magic start text?
>
> I ask this because I'm getting weird results in my search now, as  
> all documents are being matched.  When the query is finally run, it  
> looks like this:
>
> +(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1  
> +title:'0start0 f'* +cversion:1
>
> (you can ignore all but the title field in this case, the rest is  
> correct for our app).
>
> Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Paul Smith <ps...@aconex.com>.
> 2) index a magic token at the start of the title and include that in a
> phrase query:
>    "_START_ the quick"

Ok, I've gone and chose "0start0" as my start token, because our  
analyzer is stripping _.

Now, second dumb question of the day, give the search for starts with  
"The qui*", that has to be turned into a prefix query, like so?

new PrefixQuery(new Term("title", "0start0 " +  "The qui"))

Is that the right approach?  To always prefix the search term string  
with the magic start text?

I ask this because I'm getting weird results in my search now, as all  
documents are being matched.  When the query is finally run, it looks  
like this:

+(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1 +title:'0start0  
f'* +cversion:1

(you can ignore all but the title field in this case, the rest is  
correct for our app).

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Paul Smith <ps...@aconex.com>.
> 1) also index the field untokenized and use a straight prefix query
See my reply to Chris, not sure I can afford the index size increment.

> 2) index a magic token at the start of the title and include that in a
> phrase query:
>    "_START_ the quick"

ooooh, that's clever.

> 3) use a SpanFirst query (but you have to make the Java Query  
> object yourself)

Will SpanFirst find the phrase at the start, but where the term might  
not be compelete, ie. search would be "The qu*"?

the _START_ trick might do it, and efficient in terms of index size etc.

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Yonik Seeley <ys...@gmail.com>.
Off the top of my head:
1) also index the field untokenized and use a straight prefix query
2) index a magic token at the start of the title and include that in a
phrase query:
   "_START_ the quick"
3) use a SpanFirst query (but you have to make the Java Query object yourself)

-Yonik

On 1/5/06, Paul Smith <ps...@aconex.com> wrote:
> I'm throwing myself at the mercy of the lucene community, I'm a bit
> brain dead today after looking after a screaming 3 month old baby for
> 4 hours last night...
>
> We have a 'title' field indexed as Field.Text(...), which works
> nicely, and has lots of good searching.
>
> However, this application is being ported from a DB based search,
> which had originally a "starts with" type search, and we need to
> support that.
>
> So, given "The quick brown fox jumped of the caffeine addicted
> software developer", if the user types "The quick*" we need to find
> only those that have a title that starts with, and NOT match
> documents that have "The quick" as a sequence of terms later in the
> title (ie don't match "blah blah the quick blah blah").
>
> Think SQL of " ....where title like 'The quick%'  ".
>
> How do I do that with Lucene?  I'm sure this a is a dumb question,
> and I know that Lucene's searching is way more useful than that, but
> you know these pesky compatibility requirements.....It's screwing
> with my unit tests because the new index search results are getting
> more results that the old db method.
>
> cheers,
>
> Paul Smith
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Paul Smith <ps...@aconex.com>.
>
> one thing you may not have thought about yet that may affect your
> decision: sorting in lucene requires the field be indexed but  
> untokenized.
> so if you want to support sortting on the conceptual "title",  
> you'll still
> need a version of your title field that's untokenized, which can  
> then be
> used for these types of queries for free.
>
> (it's the kind of thing people sometimes don't realize untill late in
> their development, they make sure all of their queries return the  
> results
> they expect before they worry about what kinds of sorting they need to
> support)

Thanks for that note, we had decided that this was a field we were  
just not going to be able to support sorting on, for both index size  
and search/sort performance.

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks Chris, I had thought of that one, but unfortunately the title
: could be quite long, and there are literally millions of documents.
: Isn't each title going to be included as one "term" in the index
: dictionary?  If so, won't the index get ridiculously large and slow?

It depends on your definitions of "large" and "slow" ... I haven't had any
complains, but i haven't really tried chaning my implimentation to see if
i can improve on it.

one thing you may not have thought about yet that may affect your
decision: sorting in lucene requires the field be indexed but untokenized.
so if you want to support sortting on the conceptual "title", you'll still
need a version of your title field that's untokenized, which can then be
used for these types of queries for free.

(it's the kind of thing people sometimes don't realize untill late in
their development, they make sure all of their queries return the results
they expect before they worry about what kinds of sorting they need to
support)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Paul Smith <ps...@aconex.com>.
On 06/01/2006, at 9:33 AM, Chris Hostetter wrote:

>
>
> : Think SQL of " ....where title like 'The quick%'  ".
>
> I solved this problem by having a variation of my field that was not
> tokenized, and did PrefixQueries on that field (so in your case, leave
> your title field alone for generic matches, and have a  
> titleUntokenized
> field for PrefixMatches.

Thanks Chris, I had thought of that one, but unfortunately the title  
could be quite long, and there are literally millions of documents.   
Isn't each title going to be included as one "term" in the index  
dictionary?  If so, won't the index get ridiculously large and slow?
>
> Another approach i have not tried that should work as long as your  
> "starts
> with" input is allways whole Terms (ie: "The quick %" and not "The  
> qu%")
> is to use a SpanFirstQuery wrapped arround a a SpanNearQuery  
> containing
> successive SpanTermQueries with no slop.

ooh err, Span queries, I'd better go have a look at them.

thanks,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: "Starts with" query?

Posted by Chris Hostetter <ho...@fucit.org>.

: Think SQL of " ....where title like 'The quick%'  ".

I solved this problem by having a variation of my field that was not
tokenized, and did PrefixQueries on that field (so in your case, leave
your title field alone for generic matches, and have a titleUntokenized
field for PrefixMatches.

Another approach i have not tried that should work as long as your "starts
with" input is allways whole Terms (ie: "The quick %" and not "The qu%")
is to use a SpanFirstQuery wrapped arround a a SpanNearQuery containing
successive SpanTermQueries with no slop.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org