You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2006/01/05 23:09:14 UTC

"Starts with" query?

I'm throwing myself at the mercy of the lucene community, I'm a bit  
brain dead today after looking after a screaming 3 month old baby for  
4 hours last night...

We have a 'title' field indexed as Field.Text(...), which works  
nicely, and has lots of good searching.

However, this application is being ported from a DB based search,  
which had originally a "starts with" type search, and we need to  
support that.

So, given "The quick brown fox jumped of the caffeine addicted  
software developer", if the user types "The quick*" we need to find  
only those that have a title that starts with, and NOT match  
documents that have "The quick" as a sequence of terms later in the  
title (ie don't match "blah blah the quick blah blah").

Think SQL of " ....where title like 'The quick%'  ".

How do I do that with Lucene?  I'm sure this a is a dumb question,  
and I know that Lucene's searching is way more useful than that, but  
you know these pesky compatibility requirements.....It's screwing  
with my unit tests because the new index search results are getting  
more results that the old db method.

cheers,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: "Starts with" query?

Posted by Yonik Seeley <ys...@gmail.com>.

That's deprecated now of course... so you want MultiPhraseQuery.

-Yonik

On 1/5/06, Yonik Seeley <ys...@gmail.com> wrote:
> Check out PhrasePrefixQuery.
>
> -Yonik
>
>
> On 1/5/06, Paul Smith <ps...@aconex.com> wrote:
> > first off response to my own post, I meant PhraseQuery instead.
> >
> > But, since we're only tokenizing this field ,and not storing the
> > entire contents of the field, I'm not sure this is ever going to
> > work, is it?
> >
> > I notice that if I have a title "auto update", then the phrase query
> > trick works if it searches on
> >
> >         title:"0start0 auto*"
> >
> > but does not find any matches for
> >
> >         title:"0start0 aut*"
> >
> > I'm a bit stuck.
> >
> > Paul
> >
> > On 06/01/2006, at 10:43 AM, Paul Smith wrote:
> >
> > >> 2) index a magic token at the start of the title and include that
> > >> in a
> > >> phrase query:
> > >>    "_START_ the quick"
> > >
> > > Ok, I've gone and chose "0start0" as my start token, because our
> > > analyzer is stripping _.
> > >
> > > Now, second dumb question of the day, give the search for starts
> > > with "The qui*", that has to be turned into a prefix query, like so?
> > >
> > > new PrefixQuery(new Term("title", "0start0 " +  "The qui"))
> > >
> > > Is that the right approach?  To always prefix the search term
> > > string with the magic start text?
> > >
> > > I ask this because I'm getting weird results in my search now, as
> > > all documents are being matched.  When the query is finally run, it
> > > looks like this:
> > >
> > > +(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1
> > > +title:'0start0 f'* +cversion:1
> > >
> > > (you can ignore all but the title field in this case, the rest is
> > > correct for our app).
> > >
> > > Paul
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: "Starts with" query?

Posted by Yonik Seeley <ys...@gmail.com>.

Check out PhrasePrefixQuery.

-Yonik


On 1/5/06, Paul Smith <ps...@aconex.com> wrote:
> first off response to my own post, I meant PhraseQuery instead.
>
> But, since we're only tokenizing this field ,and not storing the
> entire contents of the field, I'm not sure this is ever going to
> work, is it?
>
> I notice that if I have a title "auto update", then the phrase query
> trick works if it searches on
>
>         title:"0start0 auto*"
>
> but does not find any matches for
>
>         title:"0start0 aut*"
>
> I'm a bit stuck.
>
> Paul
>
> On 06/01/2006, at 10:43 AM, Paul Smith wrote:
>
> >> 2) index a magic token at the start of the title and include that
> >> in a
> >> phrase query:
> >>    "_START_ the quick"
> >
> > Ok, I've gone and chose "0start0" as my start token, because our
> > analyzer is stripping _.
> >
> > Now, second dumb question of the day, give the search for starts
> > with "The qui*", that has to be turned into a prefix query, like so?
> >
> > new PrefixQuery(new Term("title", "0start0 " +  "The qui"))
> >
> > Is that the right approach?  To always prefix the search term
> > string with the magic start text?
> >
> > I ask this because I'm getting weird results in my search now, as
> > all documents are being matched.  When the query is finally run, it
> > looks like this:
> >
> > +(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1
> > +title:'0start0 f'* +cversion:1
> >
> > (you can ignore all but the title field in this case, the rest is
> > correct for our app).
> >
> > Paul
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: "Starts with" query?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 6, 2006, at 7:00 AM, Erik Hatcher wrote:
>> I notice that if I have a title "auto update", then the phrase  
>> query trick works if it searches on
>>
>> 	title:"0start0 auto*"
>>
>> but does not find any matches for
>>
>> 	title:"0start0 aut*"
>>
>> I'm a bit stuck.
>
> PhraseQuery does not handle wildcards.  Unfortunately this is  
> common misunderstanding.
>
> The MultiPhraseQuery could do this provided you expand "aut*" into  
> all the matching terms yourself.  But here is an alternative using  
> the new SpanRegexQuery (in contrib/regex):
>
>     RAMDirectory directory = new RAMDirectory();
>     IndexWriter writer = new IndexWriter(directory, new  
> SimpleAnalyzer(), true);
>     Document doc = new Document();
>     doc.add(new Field("field", "auto update", Field.Store.NO,  
> Field.Index.TOKENIZED));
>     writer.addDocument(doc);
>     doc = new Document();
>     doc.add(new Field("field", "first auto update", Field.Store.NO,  
> Field.Index.TOKENIZED));
>     writer.addDocument(doc);
>     writer.optimize();
>     writer.close();
>
>     IndexSearcher searcher = new IndexSearcher(directory);
>     SpanRegexQuery srq = new SpanRegexQuery(new Term("field",  
> "aut.*"));
>     SpanFirstQuery sfq = new SpanFirstQuery(srq, 1);
>     Hits hits = searcher.search(sfq);
>     assertEquals(1, hits.length());
>
> Notice that the query is "aut.*", not "aut*" such that it is a  
> valid regular expression for what you want.  In my current project,  
> my custom query parser handles * and ? like WildcardQuery, but  
> under the covers I simply convert that into a regex by replacing ?  
> with . and * with .*

Let me add a major caveat, especially given that Paul's index is  
large.  (Span)RegexQuery by default, currently, scans through *every*  
term in the index.  This is due to the complexity in determining the  
prefix of the regex.  While it is obvious that "aut.*" should only  
scan through terms starting with "aut", it gets more complicated with  
expressions like "a?uto" because the "a" is optional.  There is a  
Jakarta Regexp implementation in contrib/regex also and it is capable  
of determining the static prefix to reduce term enumeration, but I  
suspect java.util.regex is much faster than Jakarta Regexp.  I'm  
using, in my project, a blending of the two letting Jakarta Regexp  
determine the prefix but using java.util.regex for matching - this  
requires a custom, and trivial, implementation of RegexCapabilities.   
I didn't include that in contrib/regex because it seems a bit awkward  
for general consumption.

Anyway, caveat emptor for term enumeration with (Span)RegexQuery!    
Also, doing term rotation on indexing and with searching can also  
greatly reduce term enumeration even with leading wildcards - but  
I'll leave that as an exercise for the reader for now :)

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org