You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Storey, Jeff" <je...@dac.us> on 2006/11/11 19:06:10 UTC

Partial Word Matches

Hi. I'm using Lucene to do some searching (using the Searcher object and
passing it a ParsedQuery). I search for a word such as "long" and it is
returning partial matches, such as "belong" and "along." Is there a way
to turn off this behavior and only match whole words?

 

Thank you,

Jeff

 


Re: Partial Word Matches

Posted by Erick Erickson <er...@gmail.com>.
I *knew* it wasn't just a random thing, solely intended to confuse me
<G>....

Thanks for that..

Erick

On 11/11/06, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : Well, first ~ isn't a wildcard, it's a "fuzzy search" (aka similar
> terms).
> : So getting "bellow" for "yellow~" is expected. Although, somewhat
> : confusingly, "lemon orange"~10 is a proximity search.
>
> FYI: the meme there is that in the QueryParser ~ denotes something is
> sloppy or "loose" ... so  yellow~0.8  is a loose match on the word
> "yellow" (with some character insertions/reordering/deletions allowed) and
> "lemon orange"~10  is a loose match on the phrase "lemon orange" (with
> some word insertions/reordering allowed)
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Partial Word Matches

Posted by Chris Hostetter <ho...@fucit.org>.
: Well, first ~ isn't a wildcard, it's a "fuzzy search" (aka similar terms).
: So getting "bellow" for "yellow~" is expected. Although, somewhat
: confusingly, "lemon orange"~10 is a proximity search.

FYI: the meme there is that in the QueryParser ~ denotes something is
sloppy or "loose" ... so  yellow~0.8  is a loose match on the word
"yellow" (with some character insertions/reordering/deletions allowed) and
"lemon orange"~10  is a loose match on the phrase "lemon orange" (with
some word insertions/reordering allowed)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Partial Word Matches

Posted by "Storey, Jeff" <je...@dac.us>.
Erick,

Very useful answers -- I'll be reading up more with the links you've
provided.

Thanks.
Jeff

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Saturday, November 11, 2006 5:51 PM
To: java-user@lucene.apache.org
Subject: Re: Partial Word Matches

See below.....

On 11/11/06, Storey, Jeff <je...@dac.us> wrote:
>
> I stand corrected -- I am NOT getting partial matches, I was
extracting
> partial matches from the text programmatically and thought that's what
> was being returned.
>
> On another topic, regarding Boolean queries and wildcard queries, I
have
> two questions:
>
> It seems like when I enter the query "ball AND basket" it returns
> different results than "ball and basket." Is there a way to make the
> Boolean operators case insensitive?


No. I've found this kind of odd too. Here's a snippet...

Boolean operators allow terms to be combined through logic operators.
Lucene
supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean
operators must be ALL CAPS).

from http://sewm.pku.edu.cn/src/clucene/doc/queryparsersyntax.html



As for wildcard searches, one of the things I attempt to do is pick out
> the words that caused a particular file to be returned. However, when
I
> search for the term "yellow~" I might get something like "bellow." Is
> there a way to list what Lucene found in the document that made it
> relevant?


Well, first ~ isn't a wildcard, it's a "fuzzy search" (aka similar
terms).
So getting "bellow" for "yellow~" is expected. Although, somewhat
confusingly, "lemon orange"~10 is a proximity search.

* and ? are the wildcard characters.

Searcher.explain() is probably your friend, although I haven't used it
much,
I've certainly seen it mentioned enough......

If you haven't, get a copy of Luke (google lucene luke). It's a program
that
allows you to explore your indexes, explain queries, explore the effects
of
different analyzers, etc. Really, really, really get a copy <G>....


Thanks for all the help.
>
> Jeff
>
> -----Original Message-----
> From: Paul Borgermans [mailto:paul.borgermans@gmail.com]
> Sent: Saturday, November 11, 2006 3:06 PM
> To: java-user@lucene.apache.org
> Subject: Re: Partial Word Matches
>
> Indeed, the only way this can happen as far as I know Lucene is by
using
> a
> stemmer during indexing, the standard analyzer won't result in such
> behaviour.
>
> hth
>
> Paul
>
> On 11/11/06, Erick Erickson <er...@gmail.com> wrote:
> >
> > That's not the default behavior, so I'm perplexed. Normally, you
have
> to
> > go
> > to considerable effort to get partial matches....
> >
> > What analyzers are you using at both index and query time? Perhaps
as
> > short
> > a code snippet as you could make showing this behavior would be a
good
> > thing
> > to post. I flat guarantee folks will look at it. But please make it
> short
> > <G>.
> >
> > Best
> > ERick
> >
> > On 11/11/06, Storey, Jeff <je...@dac.us> wrote:
> > >
> > > Hi. I'm using Lucene to do some searching (using the Searcher
object
> and
> > > passing it a ParsedQuery). I search for a word such as "long" and
it
> is
> > > returning partial matches, such as "belong" and "along." Is there
a
> way
> > > to turn off this behavior and only match whole words?
> > >
> > >
> > >
> > > Thank you,
> > >
> > > Jeff
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>
> --
> http://walhalla.wordpress.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Partial Word Matches

Posted by Erick Erickson <er...@gmail.com>.
See below.....

On 11/11/06, Storey, Jeff <je...@dac.us> wrote:
>
> I stand corrected -- I am NOT getting partial matches, I was extracting
> partial matches from the text programmatically and thought that's what
> was being returned.
>
> On another topic, regarding Boolean queries and wildcard queries, I have
> two questions:
>
> It seems like when I enter the query "ball AND basket" it returns
> different results than "ball and basket." Is there a way to make the
> Boolean operators case insensitive?


No. I've found this kind of odd too. Here's a snippet...

Boolean operators allow terms to be combined through logic operators. Lucene
supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean
operators must be ALL CAPS).

from http://sewm.pku.edu.cn/src/clucene/doc/queryparsersyntax.html



As for wildcard searches, one of the things I attempt to do is pick out
> the words that caused a particular file to be returned. However, when I
> search for the term "yellow~" I might get something like "bellow." Is
> there a way to list what Lucene found in the document that made it
> relevant?


Well, first ~ isn't a wildcard, it's a "fuzzy search" (aka similar terms).
So getting "bellow" for "yellow~" is expected. Although, somewhat
confusingly, "lemon orange"~10 is a proximity search.

* and ? are the wildcard characters.

Searcher.explain() is probably your friend, although I haven't used it much,
I've certainly seen it mentioned enough......

If you haven't, get a copy of Luke (google lucene luke). It's a program that
allows you to explore your indexes, explain queries, explore the effects of
different analyzers, etc. Really, really, really get a copy <G>....


Thanks for all the help.
>
> Jeff
>
> -----Original Message-----
> From: Paul Borgermans [mailto:paul.borgermans@gmail.com]
> Sent: Saturday, November 11, 2006 3:06 PM
> To: java-user@lucene.apache.org
> Subject: Re: Partial Word Matches
>
> Indeed, the only way this can happen as far as I know Lucene is by using
> a
> stemmer during indexing, the standard analyzer won't result in such
> behaviour.
>
> hth
>
> Paul
>
> On 11/11/06, Erick Erickson <er...@gmail.com> wrote:
> >
> > That's not the default behavior, so I'm perplexed. Normally, you have
> to
> > go
> > to considerable effort to get partial matches....
> >
> > What analyzers are you using at both index and query time? Perhaps as
> > short
> > a code snippet as you could make showing this behavior would be a good
> > thing
> > to post. I flat guarantee folks will look at it. But please make it
> short
> > <G>.
> >
> > Best
> > ERick
> >
> > On 11/11/06, Storey, Jeff <je...@dac.us> wrote:
> > >
> > > Hi. I'm using Lucene to do some searching (using the Searcher object
> and
> > > passing it a ParsedQuery). I search for a word such as "long" and it
> is
> > > returning partial matches, such as "belong" and "along." Is there a
> way
> > > to turn off this behavior and only match whole words?
> > >
> > >
> > >
> > > Thank you,
> > >
> > > Jeff
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>
> --
> http://walhalla.wordpress.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Partial Word Matches

Posted by "Storey, Jeff" <je...@dac.us>.
For proprietary reasons, I cannot post code samples, but I can give you
more details as to what I am doing. I am basically trying to search a
directory of text files.

Step 1: Create an IndexWriter for the directory being searched.

Step 2: For each text file, create a new Document object and add the
document title and content as fields to this Lucene document. Add these
documents to the IndexWriter.

Step 3: Create a QueryParser and parse a user entered query.

Step 4: Create an IndexSearcher to search the directory created by the
IndexWriter.

Step 5: Use the search method of the IndexSearcher to search the parsed
query created in Step 3.

That's it. Is this the proper way to be doing searching?

Thanks.
Jeff

-----Original Message-----
From: Paul Borgermans [mailto:paul.borgermans@gmail.com] 
Sent: Saturday, November 11, 2006 3:06 PM
To: java-user@lucene.apache.org
Subject: Re: Partial Word Matches

Indeed, the only way this can happen as far as I know Lucene is by using
a
stemmer during indexing, the standard analyzer won't result in such
behaviour.

hth

Paul

On 11/11/06, Erick Erickson <er...@gmail.com> wrote:
>
> That's not the default behavior, so I'm perplexed. Normally, you have
to
> go
> to considerable effort to get partial matches....
>
> What analyzers are you using at both index and query time? Perhaps as
> short
> a code snippet as you could make showing this behavior would be a good
> thing
> to post. I flat guarantee folks will look at it. But please make it
short
> <G>.
>
> Best
> ERick
>
> On 11/11/06, Storey, Jeff <je...@dac.us> wrote:
> >
> > Hi. I'm using Lucene to do some searching (using the Searcher object
and
> > passing it a ParsedQuery). I search for a word such as "long" and it
is
> > returning partial matches, such as "belong" and "along." Is there a
way
> > to turn off this behavior and only match whole words?
> >
> >
> >
> > Thank you,
> >
> > Jeff
> >
> >
> >
> >
> >
>
>


-- 
http://walhalla.wordpress.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Partial Word Matches

Posted by "Storey, Jeff" <je...@dac.us>.
I stand corrected -- I am NOT getting partial matches, I was extracting
partial matches from the text programmatically and thought that's what
was being returned.

On another topic, regarding Boolean queries and wildcard queries, I have
two questions:

It seems like when I enter the query "ball AND basket" it returns
different results than "ball and basket." Is there a way to make the
Boolean operators case insensitive?

As for wildcard searches, one of the things I attempt to do is pick out
the words that caused a particular file to be returned. However, when I
search for the term "yellow~" I might get something like "bellow." Is
there a way to list what Lucene found in the document that made it
relevant?

Thanks for all the help.

Jeff

-----Original Message-----
From: Paul Borgermans [mailto:paul.borgermans@gmail.com] 
Sent: Saturday, November 11, 2006 3:06 PM
To: java-user@lucene.apache.org
Subject: Re: Partial Word Matches

Indeed, the only way this can happen as far as I know Lucene is by using
a
stemmer during indexing, the standard analyzer won't result in such
behaviour.

hth

Paul

On 11/11/06, Erick Erickson <er...@gmail.com> wrote:
>
> That's not the default behavior, so I'm perplexed. Normally, you have
to
> go
> to considerable effort to get partial matches....
>
> What analyzers are you using at both index and query time? Perhaps as
> short
> a code snippet as you could make showing this behavior would be a good
> thing
> to post. I flat guarantee folks will look at it. But please make it
short
> <G>.
>
> Best
> ERick
>
> On 11/11/06, Storey, Jeff <je...@dac.us> wrote:
> >
> > Hi. I'm using Lucene to do some searching (using the Searcher object
and
> > passing it a ParsedQuery). I search for a word such as "long" and it
is
> > returning partial matches, such as "belong" and "along." Is there a
way
> > to turn off this behavior and only match whole words?
> >
> >
> >
> > Thank you,
> >
> > Jeff
> >
> >
> >
> >
> >
>
>


-- 
http://walhalla.wordpress.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Partial Word Matches

Posted by Paul Borgermans <pa...@gmail.com>.
Indeed, the only way this can happen as far as I know Lucene is by using a
stemmer during indexing, the standard analyzer won't result in such
behaviour.

hth

Paul

On 11/11/06, Erick Erickson <er...@gmail.com> wrote:
>
> That's not the default behavior, so I'm perplexed. Normally, you have to
> go
> to considerable effort to get partial matches....
>
> What analyzers are you using at both index and query time? Perhaps as
> short
> a code snippet as you could make showing this behavior would be a good
> thing
> to post. I flat guarantee folks will look at it. But please make it short
> <G>.
>
> Best
> ERick
>
> On 11/11/06, Storey, Jeff <je...@dac.us> wrote:
> >
> > Hi. I'm using Lucene to do some searching (using the Searcher object and
> > passing it a ParsedQuery). I search for a word such as "long" and it is
> > returning partial matches, such as "belong" and "along." Is there a way
> > to turn off this behavior and only match whole words?
> >
> >
> >
> > Thank you,
> >
> > Jeff
> >
> >
> >
> >
> >
>
>


-- 
http://walhalla.wordpress.com

Re: Partial Word Matches

Posted by Erick Erickson <er...@gmail.com>.
That's not the default behavior, so I'm perplexed. Normally, you have to go
to considerable effort to get partial matches....

What analyzers are you using at both index and query time? Perhaps as short
a code snippet as you could make showing this behavior would be a good thing
to post. I flat guarantee folks will look at it. But please make it short
<G>.

Best
ERick

On 11/11/06, Storey, Jeff <je...@dac.us> wrote:
>
> Hi. I'm using Lucene to do some searching (using the Searcher object and
> passing it a ParsedQuery). I search for a word such as "long" and it is
> returning partial matches, such as "belong" and "along." Is there a way
> to turn off this behavior and only match whole words?
>
>
>
> Thank you,
>
> Jeff
>
>
>
>
>