You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2006/01/05 23:09:14 UTC
"Starts with" query?
I'm throwing myself at the mercy of the lucene community, I'm a bit
brain dead today after looking after a screaming 3 month old baby for
4 hours last night...
We have a 'title' field indexed as Field.Text(...), which works
nicely, and has lots of good searching.
However, this application is being ported from a DB based search,
which had originally a "starts with" type search, and we need to
support that.
So, given "The quick brown fox jumped of the caffeine addicted
software developer", if the user types "The quick*" we need to find
only those that have a title that starts with, and NOT match
documents that have "The quick" as a sequence of terms later in the
title (ie don't match "blah blah the quick blah blah").
Think SQL of " ....where title like 'The quick%' ".
How do I do that with Lucene? I'm sure this a is a dumb question,
and I know that Lucene's searching is way more useful than that, but
you know these pesky compatibility requirements.....It's screwing
with my unit tests because the new index search results are getting
more results that the old db method.
cheers,
Paul Smith
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Yonik Seeley <ys...@gmail.com>.
That's deprecated now of course... so you want MultiPhraseQuery.
-Yonik
On 1/5/06, Yonik Seeley <ys...@gmail.com> wrote:
> Check out PhrasePrefixQuery.
>
> -Yonik
>
>
> On 1/5/06, Paul Smith <ps...@aconex.com> wrote:
> > first off response to my own post, I meant PhraseQuery instead.
> >
> > But, since we're only tokenizing this field ,and not storing the
> > entire contents of the field, I'm not sure this is ever going to
> > work, is it?
> >
> > I notice that if I have a title "auto update", then the phrase query
> > trick works if it searches on
> >
> > title:"0start0 auto*"
> >
> > but does not find any matches for
> >
> > title:"0start0 aut*"
> >
> > I'm a bit stuck.
> >
> > Paul
> >
> > On 06/01/2006, at 10:43 AM, Paul Smith wrote:
> >
> > >> 2) index a magic token at the start of the title and include that
> > >> in a
> > >> phrase query:
> > >> "_START_ the quick"
> > >
> > > Ok, I've gone and chose "0start0" as my start token, because our
> > > analyzer is stripping _.
> > >
> > > Now, second dumb question of the day, give the search for starts
> > > with "The qui*", that has to be turned into a prefix query, like so?
> > >
> > > new PrefixQuery(new Term("title", "0start0 " + "The qui"))
> > >
> > > Is that the right approach? To always prefix the search term
> > > string with the magic start text?
> > >
> > > I ask this because I'm getting weird results in my search now, as
> > > all documents are being matched. When the query is finally run, it
> > > looks like this:
> > >
> > > +(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1
> > > +title:'0start0 f'* +cversion:1
> > >
> > > (you can ignore all but the title field in this case, the rest is
> > > correct for our app).
> > >
> > > Paul
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Yonik Seeley <ys...@gmail.com>.
Check out PhrasePrefixQuery.
-Yonik
On 1/5/06, Paul Smith <ps...@aconex.com> wrote:
> first off response to my own post, I meant PhraseQuery instead.
>
> But, since we're only tokenizing this field ,and not storing the
> entire contents of the field, I'm not sure this is ever going to
> work, is it?
>
> I notice that if I have a title "auto update", then the phrase query
> trick works if it searches on
>
> title:"0start0 auto*"
>
> but does not find any matches for
>
> title:"0start0 aut*"
>
> I'm a bit stuck.
>
> Paul
>
> On 06/01/2006, at 10:43 AM, Paul Smith wrote:
>
> >> 2) index a magic token at the start of the title and include that
> >> in a
> >> phrase query:
> >> "_START_ the quick"
> >
> > Ok, I've gone and chose "0start0" as my start token, because our
> > analyzer is stripping _.
> >
> > Now, second dumb question of the day, give the search for starts
> > with "The qui*", that has to be turned into a prefix query, like so?
> >
> > new PrefixQuery(new Term("title", "0start0 " + "The qui"))
> >
> > Is that the right approach? To always prefix the search term
> > string with the magic start text?
> >
> > I ask this because I'm getting weird results in my search now, as
> > all documents are being matched. When the query is finally run, it
> > looks like this:
> >
> > +(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1
> > +title:'0start0 f'* +cversion:1
> >
> > (you can ignore all but the title field in this case, the rest is
> > correct for our app).
> >
> > Paul
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 6, 2006, at 7:00 AM, Erik Hatcher wrote:
>> I notice that if I have a title "auto update", then the phrase
>> query trick works if it searches on
>>
>> title:"0start0 auto*"
>>
>> but does not find any matches for
>>
>> title:"0start0 aut*"
>>
>> I'm a bit stuck.
>
> PhraseQuery does not handle wildcards. Unfortunately this is
> common misunderstanding.
>
> The MultiPhraseQuery could do this provided you expand "aut*" into
> all the matching terms yourself. But here is an alternative using
> the new SpanRegexQuery (in contrib/regex):
>
> RAMDirectory directory = new RAMDirectory();
> IndexWriter writer = new IndexWriter(directory, new
> SimpleAnalyzer(), true);
> Document doc = new Document();
> doc.add(new Field("field", "auto update", Field.Store.NO,
> Field.Index.TOKENIZED));
> writer.addDocument(doc);
> doc = new Document();
> doc.add(new Field("field", "first auto update", Field.Store.NO,
> Field.Index.TOKENIZED));
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
>
> IndexSearcher searcher = new IndexSearcher(directory);
> SpanRegexQuery srq = new SpanRegexQuery(new Term("field",
> "aut.*"));
> SpanFirstQuery sfq = new SpanFirstQuery(srq, 1);
> Hits hits = searcher.search(sfq);
> assertEquals(1, hits.length());
>
> Notice that the query is "aut.*", not "aut*" such that it is a
> valid regular expression for what you want. In my current project,
> my custom query parser handles * and ? like WildcardQuery, but
> under the covers I simply convert that into a regex by replacing ?
> with . and * with .*
Let me add a major caveat, especially given that Paul's index is
large. (Span)RegexQuery by default, currently, scans through *every*
term in the index. This is due to the complexity in determining the
prefix of the regex. While it is obvious that "aut.*" should only
scan through terms starting with "aut", it gets more complicated with
expressions like "a?uto" because the "a" is optional. There is a
Jakarta Regexp implementation in contrib/regex also and it is capable
of determining the static prefix to reduce term enumeration, but I
suspect java.util.regex is much faster than Jakarta Regexp. I'm
using, in my project, a blending of the two letting Jakarta Regexp
determine the prefix but using java.util.regex for matching - this
requires a custom, and trivial, implementation of RegexCapabilities.
I didn't include that in contrib/regex because it seems a bit awkward
for general consumption.
Anyway, caveat emptor for term enumeration with (Span)RegexQuery!
Also, doing term rotation on indexing and with searching can also
greatly reduce term enumeration even with leading wildcards - but
I'll leave that as an exercise for the reader for now :)
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 5, 2006, at 7:01 PM, Paul Smith wrote:
> first off response to my own post, I meant PhraseQuery instead.
>
> But, since we're only tokenizing this field ,and not storing the
> entire contents of the field, I'm not sure this is ever going to
> work, is it?
Sure it will :)
> I notice that if I have a title "auto update", then the phrase
> query trick works if it searches on
>
> title:"0start0 auto*"
>
> but does not find any matches for
>
> title:"0start0 aut*"
>
> I'm a bit stuck.
PhraseQuery does not handle wildcards. Unfortunately this is common
misunderstanding.
The MultiPhraseQuery could do this provided you expand "aut*" into
all the matching terms yourself. But here is an alternative using
the new SpanRegexQuery (in contrib/regex):
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new
SimpleAnalyzer(), true);
Document doc = new Document();
doc.add(new Field("field", "auto update", Field.Store.NO,
Field.Index.TOKENIZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field("field", "first auto update", Field.Store.NO,
Field.Index.TOKENIZED));
writer.addDocument(doc);
writer.optimize();
writer.close();
IndexSearcher searcher = new IndexSearcher(directory);
SpanRegexQuery srq = new SpanRegexQuery(new Term("field",
"aut.*"));
SpanFirstQuery sfq = new SpanFirstQuery(srq, 1);
Hits hits = searcher.search(sfq);
assertEquals(1, hits.length());
Notice that the query is "aut.*", not "aut*" such that it is a valid
regular expression for what you want. In my current project, my
custom query parser handles * and ? like WildcardQuery, but under the
covers I simply convert that into a regex by replacing ? with . and *
with .*
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Paul Smith <ps...@aconex.com>.
first off response to my own post, I meant PhraseQuery instead.
But, since we're only tokenizing this field ,and not storing the
entire contents of the field, I'm not sure this is ever going to
work, is it?
I notice that if I have a title "auto update", then the phrase query
trick works if it searches on
title:"0start0 auto*"
but does not find any matches for
title:"0start0 aut*"
I'm a bit stuck.
Paul
On 06/01/2006, at 10:43 AM, Paul Smith wrote:
>> 2) index a magic token at the start of the title and include that
>> in a
>> phrase query:
>> "_START_ the quick"
>
> Ok, I've gone and chose "0start0" as my start token, because our
> analyzer is stripping _.
>
> Now, second dumb question of the day, give the search for starts
> with "The qui*", that has to be turned into a prefix query, like so?
>
> new PrefixQuery(new Term("title", "0start0 " + "The qui"))
>
> Is that the right approach? To always prefix the search term
> string with the magic start text?
>
> I ask this because I'm getting weird results in my search now, as
> all documents are being matched. When the query is finally run, it
> looks like this:
>
> +(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1
> +title:'0start0 f'* +cversion:1
>
> (you can ignore all but the title field in this case, the rest is
> correct for our app).
>
> Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Paul Smith <ps...@aconex.com>.
> 2) index a magic token at the start of the title and include that in a
> phrase query:
> "_START_ the quick"
Ok, I've gone and chose "0start0" as my start token, because our
analyzer is stripping _.
Now, second dumb question of the day, give the search for starts with
"The qui*", that has to be turned into a prefix query, like so?
new PrefixQuery(new Term("title", "0start0 " + "The qui"))
Is that the right approach? To always prefix the search term string
with the magic start text?
I ask this because I'm getting weird results in my search now, as all
documents are being matched. When the query is finally run, it looks
like this:
+(orgid:7 publicflag:1 sharedorgid:7) +isregistered:1 +title:'0start0
f'* +cversion:1
(you can ignore all but the title field in this case, the rest is
correct for our app).
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Paul Smith <ps...@aconex.com>.
> 1) also index the field untokenized and use a straight prefix query
See my reply to Chris, not sure I can afford the index size increment.
> 2) index a magic token at the start of the title and include that in a
> phrase query:
> "_START_ the quick"
ooooh, that's clever.
> 3) use a SpanFirst query (but you have to make the Java Query
> object yourself)
Will SpanFirst find the phrase at the start, but where the term might
not be compelete, ie. search would be "The qu*"?
the _START_ trick might do it, and efficient in terms of index size etc.
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Yonik Seeley <ys...@gmail.com>.
Off the top of my head:
1) also index the field untokenized and use a straight prefix query
2) index a magic token at the start of the title and include that in a
phrase query:
"_START_ the quick"
3) use a SpanFirst query (but you have to make the Java Query object yourself)
-Yonik
On 1/5/06, Paul Smith <ps...@aconex.com> wrote:
> I'm throwing myself at the mercy of the lucene community, I'm a bit
> brain dead today after looking after a screaming 3 month old baby for
> 4 hours last night...
>
> We have a 'title' field indexed as Field.Text(...), which works
> nicely, and has lots of good searching.
>
> However, this application is being ported from a DB based search,
> which had originally a "starts with" type search, and we need to
> support that.
>
> So, given "The quick brown fox jumped of the caffeine addicted
> software developer", if the user types "The quick*" we need to find
> only those that have a title that starts with, and NOT match
> documents that have "The quick" as a sequence of terms later in the
> title (ie don't match "blah blah the quick blah blah").
>
> Think SQL of " ....where title like 'The quick%' ".
>
> How do I do that with Lucene? I'm sure this a is a dumb question,
> and I know that Lucene's searching is way more useful than that, but
> you know these pesky compatibility requirements.....It's screwing
> with my unit tests because the new index search results are getting
> more results that the old db method.
>
> cheers,
>
> Paul Smith
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Paul Smith <ps...@aconex.com>.
>
> one thing you may not have thought about yet that may affect your
> decision: sorting in lucene requires the field be indexed but
> untokenized.
> so if you want to support sortting on the conceptual "title",
> you'll still
> need a version of your title field that's untokenized, which can
> then be
> used for these types of queries for free.
>
> (it's the kind of thing people sometimes don't realize untill late in
> their development, they make sure all of their queries return the
> results
> they expect before they worry about what kinds of sorting they need to
> support)
Thanks for that note, we had decided that this was a field we were
just not going to be able to support sorting on, for both index size
and search/sort performance.
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks Chris, I had thought of that one, but unfortunately the title
: could be quite long, and there are literally millions of documents.
: Isn't each title going to be included as one "term" in the index
: dictionary? If so, won't the index get ridiculously large and slow?
It depends on your definitions of "large" and "slow" ... I haven't had any
complains, but i haven't really tried chaning my implimentation to see if
i can improve on it.
one thing you may not have thought about yet that may affect your
decision: sorting in lucene requires the field be indexed but untokenized.
so if you want to support sortting on the conceptual "title", you'll still
need a version of your title field that's untokenized, which can then be
used for these types of queries for free.
(it's the kind of thing people sometimes don't realize untill late in
their development, they make sure all of their queries return the results
they expect before they worry about what kinds of sorting they need to
support)
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Paul Smith <ps...@aconex.com>.
On 06/01/2006, at 9:33 AM, Chris Hostetter wrote:
>
>
> : Think SQL of " ....where title like 'The quick%' ".
>
> I solved this problem by having a variation of my field that was not
> tokenized, and did PrefixQueries on that field (so in your case, leave
> your title field alone for generic matches, and have a
> titleUntokenized
> field for PrefixMatches.
Thanks Chris, I had thought of that one, but unfortunately the title
could be quite long, and there are literally millions of documents.
Isn't each title going to be included as one "term" in the index
dictionary? If so, won't the index get ridiculously large and slow?
>
> Another approach i have not tried that should work as long as your
> "starts
> with" input is allways whole Terms (ie: "The quick %" and not "The
> qu%")
> is to use a SpanFirstQuery wrapped arround a a SpanNearQuery
> containing
> successive SpanTermQueries with no slop.
ooh err, Span queries, I'd better go have a look at them.
thanks,
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: "Starts with" query?
Posted by Chris Hostetter <ho...@fucit.org>.
: Think SQL of " ....where title like 'The quick%' ".
I solved this problem by having a variation of my field that was not
tokenized, and did PrefixQueries on that field (so in your case, leave
your title field alone for generic matches, and have a titleUntokenized
field for PrefixMatches.
Another approach i have not tried that should work as long as your "starts
with" input is allways whole Terms (ie: "The quick %" and not "The qu%")
is to use a SpanFirstQuery wrapped arround a a SpanNearQuery containing
successive SpanTermQueries with no slop.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org