You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Walt Stoneburner <wa...@gmail.com> on 2007/04/17 17:29:24 UTC

Mixing Case and Case-Insensitive Searching

I've run into a case where we want to search for the acronym 'LET',
however this three letter word occurs very frequently in quite a
number of documents.

What I'm looking to do is a query that's case insensitive _except_ for
that specific term.

And, it appears to do so, things get very ugly, very quickly.

According to http://www.gossamer-threads.com/lists/lucene/java-user/28131?page=last
(I'm doing my homework first), it appears that one must keep a
case-sensitive and case-insensitive version in the index, if not two
separate indexes.

While this makes sense up to a point, allowing me to search one field
or the other; I'm looking for distances between words, it seems things
get far more complicated.

Because, should I store both versions and happen to know that the
tokens are the same and that their positions are identical, is there a
way do things like:

"LET organization"  (where LET is case sensitive, but part of the phrase)

"company LET"~10  (again, where LET is case sensitive, near the term
company which is case insensitive)

Would love to get some thoughts on how to go about approaching this.

Thanks,
-Walt Stoneburner

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Mixing Case and Case-Insensitive Searching

Posted by Erick Erickson <er...@gmail.com>.
Would it work to index the upper-case LET as something else? For
instance, index it as '$let'

Now, all your searches are on one index, but you have to substitute
'$let' where you want to find LET in your query.. This won't match your
other occurrences of let...

You'll have to watch to be sure your displays are appropriate. That is,
depending upon how you store your data for display you'd have to
translate '$let' back into 'LET'......

This should also work for span queries, etc.....

This won't foul  you up if you use wildcards as long as you don't
allow the wildcards to appear in the first character position...

Best
Erick

On 4/17/07, Walt Stoneburner <wa...@gmail.com> wrote:
>
> I've run into a case where we want to search for the acronym 'LET',
> however this three letter word occurs very frequently in quite a
> number of documents.
>
> What I'm looking to do is a query that's case insensitive _except_ for
> that specific term.
>
> And, it appears to do so, things get very ugly, very quickly.
>
> According to
> http://www.gossamer-threads.com/lists/lucene/java-user/28131?page=last
> (I'm doing my homework first), it appears that one must keep a
> case-sensitive and case-insensitive version in the index, if not two
> separate indexes.
>
> While this makes sense up to a point, allowing me to search one field
> or the other; I'm looking for distances between words, it seems things
> get far more complicated.
>
> Because, should I store both versions and happen to know that the
> tokens are the same and that their positions are identical, is there a
> way do things like:
>
> "LET organization"  (where LET is case sensitive, but part of the phrase)
>
> "company LET"~10  (again, where LET is case sensitive, near the term
> company which is case insensitive)
>
> Would love to get some thoughts on how to go about approaching this.
>
> Thanks,
> -Walt Stoneburner
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Mixing Case and Case-Insensitive Searching

Posted by Erick Erickson <er...@gmail.com>.
Yeah, what Hoss said. That's a much more elegant solution
than I suggested. If you use the same filter for indexing and searching,
it'll all "just happen" for you.

Erick

On 4/17/07, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : I've run into a case where we want to search for the acronym 'LET',
> : however this three letter word occurs very frequently in quite a
> : number of documents.
> :
> : What I'm looking to do is a query that's case insensitive _except_ for
> : that specific term.
>
> it sounds like you need to create a customized bastard stepshild of
> StopFilter and LowercaseFilter ... take in a dictionary of known
> capitalized acronimes in the constructor, then for each Token lowercase it
> unless:
>    - it's in all caps
>    - it's in your acronym list.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Mixing Case and Case-Insensitive Searching

Posted by Chris Hostetter <ho...@fucit.org>.
: I've run into a case where we want to search for the acronym 'LET',
: however this three letter word occurs very frequently in quite a
: number of documents.
:
: What I'm looking to do is a query that's case insensitive _except_ for
: that specific term.

it sounds like you need to create a customized bastard stepshild of
StopFilter and LowercaseFilter ... take in a dictionary of known
capitalized acronimes in the constructor, then for each Token lowercase it
unless:
   - it's in all caps
   - it's in your acronym list.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org