You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Byrne <jo...@propylon.com> on 2008/06/25 11:37:19 UTC

case insensitivity

Hi,

I know that case-insensitive searching is normally done by creating an 
all-lower-case version of the documents, and turning the search terms 
into lower case whenever this field is searched, but this approach has 
it's disadvantages.

Let's say, for example, you want to find "Dell" (with a capital "D"), 
near "computers" (with or without capitals, ie. in any case). The 
problem is that you would need to use a SpanQuery to find terms near 
each other; but if the case-sensitivity required is different for each 
term, then they will be in different fields, making the use of 
SpanQuerys inpossible.

There might be ways to work around this, but my question is: will 
case-insensitvity ever be added to Lucene as per-Term option? If not, 
can anyone tell me where I should start looking in order to make this 
change myself?

Thanks!

-JB



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: case insensitivity

Posted by Erick Erickson <er...@gmail.com>.

I suppose something like that might work, but I still think that presenting
a user with matches that sometimes work case sensitive and sometimes
doesn't would be...er..fraught.


If you can programmatically restrict your query construction and you're
*sure* this is what your users expect, you can make it work. Just index
each term twice, once lowercased and once in the native case, with 0
term increment between them. Then you can simply construct your
terms however you want and fire the result at the search. In fact, you
only need to double-index the terms you want to do case-sensitive
searches on. This will increase the size of your index less than you
think...

Best
Erick

On Wed, Jun 25, 2008 at 9:59 AM, John Byrne <jo...@propylon.com> wrote:

> What I had in mind was actually very simple: when you create a Term
> (programatically) you normally set the text and the field. I would also like
> to be able to set the case sensitivity to true or false for that specific
> Term object.
>
> I imangined (and maybe I am over simplifying it!) that somewhere in the API
> there must be a string comparison using 'String.equals()' that determines if
> a document contains the term or not - and that use of 'equals()' has
> permanently locked Lucene into case-sensitive searching. The values being
> compared could be first lower-cased (or equalsIgnoreCase could be used)
> depending on the value of a boolean flag in the Term object.
>
> If that option was there, there would be no need to ever change the case in
> the analyzer - you'd be able to control case-sensitivity regardless of the
> field used.
>
> Of course, I realize that there is currently no way to take advantage of
> such a feature in the QueryParser. It could only be done programatically.
> But I don't think that's a reason not to do it, since the API already has
> features that aren't implemented in the QueryParser (like SpanQuerys). In a
> perfect world, the parser would support all the features, but for the time
> being anyone who wants to take advantage of the newer features has to find
> an alternative anyway.
>
> The problem that it would solve for me is, as I mentioned, that I could mix
> case-sensitive Terms with case-insensitive Terms when using SpanQuerys. I
> currently have no way to do that.
>
> Regards,
> -John
>
> Erick Erickson wrote:
>
>> Well, it depends on what you mean by "per term". There's already
>> PerFieldAnalyzerWrapper for each field, but I don't think that's what
>> you want.
>>
>> How do you expect a per term analyzer to behave? I'm having a hard
>> time thinking of a use case that's general. You could always
>> roll your own analyzer that didn't change case for your particular
>> list of words.
>>
>> But the problem is your users. In your example, suppose a user
>> typed in "dell computers". Would that match "Dell computers"?
>> Does your analyzer automatically upper-case some words? If it
>> does, that's the same as lower casing them all. If it doesn't,
>> how do you explain that to your users?
>>
>> All in all, I'm having a tough time imagining how this would work.
>> It's easy enough to say "let's assume", but I suspect that
>> whatever solution satisfied your example will have its own problems
>> that are far worse than just lower-casing things.
>>
>> Best
>> Erick
>>
>>
>> On Wed, Jun 25, 2008 at 5:37 AM, John Byrne <jo...@propylon.com>
>> wrote:
>>
>>
>>
>>> Hi,
>>>
>>> I know that case-insensitive searching is normally done by creating an
>>> all-lower-case version of the documents, and turning the search terms
>>> into
>>> lower case whenever this field is searched, but this approach has it's
>>> disadvantages.
>>>
>>> Let's say, for example, you want to find "Dell" (with a capital "D"),
>>> near
>>> "computers" (with or without capitals, ie. in any case). The problem is
>>> that
>>> you would need to use a SpanQuery to find terms near each other; but if
>>> the
>>> case-sensitivity required is different for each term, then they will be
>>> in
>>> different fields, making the use of SpanQuerys inpossible.
>>>
>>> There might be ways to work around this, but my question is: will
>>> case-insensitvity ever be added to Lucene as per-Term option? If not, can
>>> anyone tell me where I should start looking in order to make this change
>>> myself?
>>>
>>> Thanks!
>>>
>>> -JB
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>  ------------------------------------------------------------------------
>>
>> No virus found in this incoming message.
>> Checked by AVG. Version: 7.5.524 / Virus Database: 270.4.1/1517 - Release
>> Date: 24/06/2008 20:41
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: case insensitivity

Posted by John Byrne <jo...@propylon.com>.

Chris Hostetter wrote:
> the enumeration is in lexigraphical order, so "Dell" is no where near 
> "dell" in the enumeration.  even if we added a boolean property to Terms 
> indicating that it's case insensitive Term the "seeking" along that 
> enumeration would be ... lss optimal ... then it can be now.
>   
Ah, now I understand!
> : > > Let's say, for example, you want to find "Dell" (with a capital "D"), near
> : > > "computers" (with or without capitals, ie. in any case). The problem is
> : > > that
> : > > you would need to use a SpanQuery to find terms near each other; but if
> : > > the
> : > > case-sensitivity required is different for each term, then they will be in
> : > > different fields, making the use of SpanQuerys inpossible.
>
> i assume by this statement that you are suggesting that you want your
> users to be able to say "find me $foo near $bar where $foo must be in the
> case i specified but bar can be in any case" is that correct?
>   
Yes, that's exactly what I meant.
> in that case Erick's point about indexing both the orriginal case and 
> some normalized casing at the same term position is the best way to go -- 
> the only downside this has compared to seperate fields is that it can 
> introduce some bias in your tf/idf values ... but that can be eliminated 
> by prefaxing all of your "normalized" terms with some unicode character 
> that your tokenizer would normally strip off.
>
>   
 From Erick's reply:

"I suppose something like that might work, but I still think that presenting
a user with matches that sometimes work case sensitive and sometimes
doesn't would be...er..fraught."

The user would, of course, choose which terms are case-sensitive when 
they query, using a modifier in the query language. (I would have to 
implement that). It's something my users have asked to be able to do -  
in their view, fields are something that should be used for different 
content, and case-sensitivity should be an option on *any* field. But 
what you have suggested should allow it to work that way, by adding both 
versions of the term at the same position.

Thanks guys!

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: case insensitivity

Posted by Chris Hostetter <ho...@fucit.org>.

: I imangined (and maybe I am over simplifying it!) that somewhere in the API
: there must be a string comparison using 'String.equals()' that determines if a
: document contains the term or not - and that use of 'equals()' has permanently
: locked Lucene into case-sensitive searching. The values being compared could
: be first lower-cased (or equalsIgnoreCase could be used) depending on the
: value of a boolean flag in the Term object.

You are over simplifying it a bit ... string comparisons are done in the 
internals, but not to compare a query "terms" to a document "terms" ... 
the index is inverted so there is a single enumeration of all indexed 
terms (regardless of which documents they are in) which maintain pointers 
to the docs that contained.  querying involves seeking along that 
enumeration to find the indexed term that corrisponds to the query term.

the enumeration is in lexigraphical order, so "Dell" is no where near 
"dell" in the enumeration.  even if we added a boolean property to Terms 
indicating that it's case insensitive Term the "seeking" along that 
enumeration would be ... lss optimal ... then it can be now.

: > > Let's say, for example, you want to find "Dell" (with a capital "D"), near
: > > "computers" (with or without capitals, ie. in any case). The problem is
: > > that
: > > you would need to use a SpanQuery to find terms near each other; but if
: > > the
: > > case-sensitivity required is different for each term, then they will be in
: > > different fields, making the use of SpanQuerys inpossible.

i assume by this statement that you are suggesting that you want your
users to be able to say "find me $foo near $bar where $foo must be in the
case i specified but bar can be in any case" is that correct?

in that case Erick's point about indexing both the orriginal case and 
some normalized casing at the same term position is the best way to go -- 
the only downside this has compared to seperate fields is that it can 
introduce some bias in your tf/idf values ... but that can be eliminated 
by prefaxing all of your "normalized" terms with some unicode character 
that your tokenizer would normally strip off.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: case insensitivity

Posted by John Byrne <jo...@propylon.com>.

What I had in mind was actually very simple: when you create a Term 
(programatically) you normally set the text and the field. I would also 
like to be able to set the case sensitivity to true or false for that 
specific Term object.

I imangined (and maybe I am over simplifying it!) that somewhere in the 
API there must be a string comparison using 'String.equals()' that 
determines if a document contains the term or not - and that use of 
'equals()' has permanently locked Lucene into case-sensitive searching. 
The values being compared could be first lower-cased (or 
equalsIgnoreCase could be used) depending on the value of a boolean flag 
in the Term object.

If that option was there, there would be no need to ever change the case 
in the analyzer - you'd be able to control case-sensitivity regardless 
of the field used.

Of course, I realize that there is currently no way to take advantage of 
such a feature in the QueryParser. It could only be done 
programatically. But I don't think that's a reason not to do it, since 
the API already has features that aren't implemented in the QueryParser 
(like SpanQuerys). In a perfect world, the parser would support all the 
features, but for the time being anyone who wants to take advantage of 
the newer features has to find an alternative anyway.

The problem that it would solve for me is, as I mentioned, that I could 
mix case-sensitive Terms with case-insensitive Terms when using 
SpanQuerys. I currently have no way to do that.

Regards,
-John

Erick Erickson wrote:
> Well, it depends on what you mean by "per term". There's already
> PerFieldAnalyzerWrapper for each field, but I don't think that's what
> you want.
>
> How do you expect a per term analyzer to behave? I'm having a hard
> time thinking of a use case that's general. You could always
> roll your own analyzer that didn't change case for your particular
> list of words.
>
> But the problem is your users. In your example, suppose a user
> typed in "dell computers". Would that match "Dell computers"?
> Does your analyzer automatically upper-case some words? If it
> does, that's the same as lower casing them all. If it doesn't,
> how do you explain that to your users?
>
> All in all, I'm having a tough time imagining how this would work.
> It's easy enough to say "let's assume", but I suspect that
> whatever solution satisfied your example will have its own problems
> that are far worse than just lower-casing things.
>
> Best
> Erick
>
>
> On Wed, Jun 25, 2008 at 5:37 AM, John Byrne <jo...@propylon.com> wrote:
>
>   
>> Hi,
>>
>> I know that case-insensitive searching is normally done by creating an
>> all-lower-case version of the documents, and turning the search terms into
>> lower case whenever this field is searched, but this approach has it's
>> disadvantages.
>>
>> Let's say, for example, you want to find "Dell" (with a capital "D"), near
>> "computers" (with or without capitals, ie. in any case). The problem is that
>> you would need to use a SpanQuery to find terms near each other; but if the
>> case-sensitivity required is different for each term, then they will be in
>> different fields, making the use of SpanQuerys inpossible.
>>
>> There might be ways to work around this, but my question is: will
>> case-insensitvity ever be added to Lucene as per-Term option? If not, can
>> anyone tell me where I should start looking in order to make this change
>> myself?
>>
>> Thanks!
>>
>> -JB
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG. 
> Version: 7.5.524 / Virus Database: 270.4.1/1517 - Release Date: 24/06/2008 20:41
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: case insensitivity

Posted by Erick Erickson <er...@gmail.com>.

Well, it depends on what you mean by "per term". There's already
PerFieldAnalyzerWrapper for each field, but I don't think that's what
you want.

How do you expect a per term analyzer to behave? I'm having a hard
time thinking of a use case that's general. You could always
roll your own analyzer that didn't change case for your particular
list of words.

But the problem is your users. In your example, suppose a user
typed in "dell computers". Would that match "Dell computers"?
Does your analyzer automatically upper-case some words? If it
does, that's the same as lower casing them all. If it doesn't,
how do you explain that to your users?

All in all, I'm having a tough time imagining how this would work.
It's easy enough to say "let's assume", but I suspect that
whatever solution satisfied your example will have its own problems
that are far worse than just lower-casing things.

Best
Erick

On Wed, Jun 25, 2008 at 5:37 AM, John Byrne <jo...@propylon.com> wrote:

> Hi,
>
> I know that case-insensitive searching is normally done by creating an
> all-lower-case version of the documents, and turning the search terms into
> lower case whenever this field is searched, but this approach has it's
> disadvantages.
>
> Let's say, for example, you want to find "Dell" (with a capital "D"), near
> "computers" (with or without capitals, ie. in any case). The problem is that
> you would need to use a SpanQuery to find terms near each other; but if the
> case-sensitivity required is different for each term, then they will be in
> different fields, making the use of SpanQuerys inpossible.
>
> There might be ways to work around this, but my question is: will
> case-insensitvity ever be added to Lucene as per-Term option? If not, can
> anyone tell me where I should start looking in order to make this change
> myself?
>
> Thanks!
>
> -JB
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>