You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Apurv Verma <ap...@bloomreach.com> on 2014/11/25 12:35:37 UTC

Case Insensitive Matching in Solr/Lucene

Hey all,
 The standard solution to doing a case-insensitive match in lucene is to
use a Lowercase filter at index and query time. However this does not
preserve the content of the original document. For example if my inverted
index is.

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

Is it possible to choose between case insensitive/ case sensitive match at
query time. The index is stored in memory in solr. My question is, if this
is stored as a hashmap with string key can I override the hashcode so that
"Quick" and "quick" return the same hash value?

Has anyone attempted this before? Is my assumption about index right? What
would be the classes and code flow to look at?

-- 
Regards,
Apurv

Re: Case Insensitive Matching in Solr/Lucene

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,

CapitalizationFilterFactory could be useful to build nice looking facet parameters.

Ahmet

On Tuesday, November 25, 2014 3:28 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
The usual solution is to have faceting using the other field (with
copyField). Usually it is because people want the original unmodified
version the string without tokenization (So, "United States of
America" instead of "united" "states" "america"). It sounds like your
case is a little different and you do want tokenized values, just not
lowercased.

In which case, I would copyField and do the different processing.
Also, in latest Solr, the recommendation is to use docValues for
fields used for faceting, so you can benefit from that speed-up as
well.

As to the different variants of the same token, some of the filters
have preserve_original flag that will generate two forms. For example
WordDelimiterFilterFactory
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilterFactory.html

There is also http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html
but it is not clear what consequent filters actually take advantage of
this duplication.

And, of course, ngram filters generate multiple token substrings, all
in the same positions. Easy to see by using an analyzer chain that has
one and testing it in the Admin UI's Analyze screen with extended
information checkbox enabled.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

On 25 November 2014 at 08:05, Apurv Verma <da...@gmail.com> wrote:
> Hey Michael,
>  Thanks for your reply. My use case is a little different. I would like to
> get the original values in facet queries but I would like to apply filter
> queries in a case insensitive fashion.
>
> For example  I require facet_query to return Quick, The, brown, ...
> But I want filter queries of the form fq=Term:"quick"
>
> Also could you please point me to some additional links on how I can index
> different variants of a token at the same position?
>
>
> --
> Regards,
> Apurv Verma
>
>
>
> On Tue, Nov 25, 2014 at 6:26 PM, Michael Sokolov <
> msokolov@safaribooksonline.com> wrote:
>
>> right -- missed Ahmet's answer there in my haste to respond ...
>>
>> -Mike
>>
>>
>> On 11/25/14 6:56 AM, Ahmet Arslan wrote:
>>
>>> Hi Apurv,
>>>
>>> I wouldn't worry about index size, increase in index size is not linear
>>> (2x) like that.
>>> Please see similar discussion :
>>> https://issues.apache.org/jira/browse/LUCENE-5620
>>>
>>> Ahmet
>>>
>>>
>>> On Tuesday, November 25, 2014 1:46 PM, Ahmet Arslan
>>> <io...@yahoo.com.INVALID> wrote:
>>>
>>>
>>>
>>> Hi Apurv,
>>>
>>> You can create an additional field for case sensitive search, and then
>>> you can switch at query time. You will have two fields (text_ci and
>>> text_lower) with different analysers populated with copyField.
>>>
>>> Ahmet
>>>
>>>
>>>
>>> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com>
>>> wrote:
>>> Hey all,
>>> The standard solution to doing a case-insensitive match in lucene is to
>>> use a Lowercase filter at index and query time. However this does not
>>> preserve the content of the original document. For example if my inverted
>>> index is.
>>>
>>> Term      Doc_1  Doc_2
>>> -------------------------
>>> Quick   |       |  X
>>> The     |   X   |
>>> brown   |   X   |  X
>>> dog     |   X   |
>>> dogs    |       |  X
>>> fox     |   X   |
>>> foxes   |       |  X
>>> in      |       |  X
>>> jumped  |   X   |
>>> lazy    |   X   |  X
>>> leap    |       |  X
>>> over    |   X   |  X
>>> quick   |   X   |
>>> summer  |       |  X
>>> the     |   X   |
>>> ------------------------
>>>
>>> Is it possible to choose between case insensitive/ case sensitive match at
>>> query time. The index is stored in memory in solr. My question is, if this
>>> is stored as a hashmap with string key can I override the hashcode so that
>>> "Quick" and "quick" return the same hash value?
>>>
>>> Has anyone attempted this before? Is my assumption about index right? What
>>> would be the classes and code flow to look at?
>>>
>>>
>>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Erick Erickson <er...@gmail.com>.

DocValues are restricted to certain types of untokenized fields,
specifically string, Trie* and UUID. So lowercasefilter is just not
even in the picture.

Furthermore, changing to DocValues requires completely re-indexing, so....

Best,
Erick

On Tue, Nov 25, 2014 at 1:26 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 11/25/2014 6:27 AM, Alexandre Rafalovitch wrote:
>> The usual solution is to have faceting using the other field (with
>> copyField). Usually it is because people want the original unmodified
>> version the string without tokenization (So, "United States of
>> America" instead of "united" "states" "america"). It sounds like your
>> case is a little different and you do want tokenized values, just not
>> lowercased.
>
> Something I've been wondering about related to facets.  This might be a
> tangent from the original issue, but it's somewhat related, so I'm
> asking it here.
>
> It's my understanding that DocValues have the same info as stored fields
> -- that is, the original value, completely unmodified by the analysis chain.
>
> It's also my understanding that DocValues get used for sorting and
> facets if they are present.
>
> If both of these assumptions/understandings are correct, then I would
> think that simply turning on DocValues for a field with the lowercase
> filter (and reindexing) would allow case-insensitive queries *plus*
> facets with the original unmodified and untokenized values.
>
> Have I got completely the wrong idea?  I haven't tested any of this.
>
> Thanks,
> Shawn
>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Shawn Heisey <ap...@elyograg.org>.

On 11/25/2014 6:27 AM, Alexandre Rafalovitch wrote:
> The usual solution is to have faceting using the other field (with
> copyField). Usually it is because people want the original unmodified
> version the string without tokenization (So, "United States of
> America" instead of "united" "states" "america"). It sounds like your
> case is a little different and you do want tokenized values, just not
> lowercased.

Something I've been wondering about related to facets.  This might be a
tangent from the original issue, but it's somewhat related, so I'm
asking it here.

It's my understanding that DocValues have the same info as stored fields
-- that is, the original value, completely unmodified by the analysis chain.

It's also my understanding that DocValues get used for sorting and
facets if they are present.

If both of these assumptions/understandings are correct, then I would
think that simply turning on DocValues for a field with the lowercase
filter (and reindexing) would allow case-insensitive queries *plus*
facets with the original unmodified and untokenized values.

Have I got completely the wrong idea?  I haven't tested any of this.

Thanks,
Shawn

Re: Case Insensitive Matching in Solr/Lucene

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

The usual solution is to have faceting using the other field (with
copyField). Usually it is because people want the original unmodified
version the string without tokenization (So, "United States of
America" instead of "united" "states" "america"). It sounds like your
case is a little different and you do want tokenized values, just not
lowercased.

In which case, I would copyField and do the different processing.
Also, in latest Solr, the recommendation is to use docValues for
fields used for faceting, so you can benefit from that speed-up as
well.

As to the different variants of the same token, some of the filters
have preserve_original flag that will generate two forms. For example
WordDelimiterFilterFactory
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilterFactory.html

There is also http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html
but it is not clear what consequent filters actually take advantage of
this duplication.

And, of course, ngram filters generate multiple token substrings, all
in the same positions. Easy to see by using an analyzer chain that has
one and testing it in the Admin UI's Analyze screen with extended
information checkbox enabled.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

On 25 November 2014 at 08:05, Apurv Verma <da...@gmail.com> wrote:
> Hey Michael,
>  Thanks for your reply. My use case is a little different. I would like to
> get the original values in facet queries but I would like to apply filter
> queries in a case insensitive fashion.
>
> For example  I require facet_query to return Quick, The, brown, ...
> But I want filter queries of the form fq=Term:"quick"
>
> Also could you please point me to some additional links on how I can index
> different variants of a token at the same position?
>
>
> --
> Regards,
> Apurv Verma
>
>
>
> On Tue, Nov 25, 2014 at 6:26 PM, Michael Sokolov <
> msokolov@safaribooksonline.com> wrote:
>
>> right -- missed Ahmet's answer there in my haste to respond ...
>>
>> -Mike
>>
>>
>> On 11/25/14 6:56 AM, Ahmet Arslan wrote:
>>
>>> Hi Apurv,
>>>
>>> I wouldn't worry about index size, increase in index size is not linear
>>> (2x) like that.
>>> Please see similar discussion :
>>> https://issues.apache.org/jira/browse/LUCENE-5620
>>>
>>> Ahmet
>>>
>>>
>>> On Tuesday, November 25, 2014 1:46 PM, Ahmet Arslan
>>> <io...@yahoo.com.INVALID> wrote:
>>>
>>>
>>>
>>> Hi Apurv,
>>>
>>> You can create an additional field for case sensitive search, and then
>>> you can switch at query time. You will have two fields (text_ci and
>>> text_lower) with different analysers populated with copyField.
>>>
>>> Ahmet
>>>
>>>
>>>
>>> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com>
>>> wrote:
>>> Hey all,
>>> The standard solution to doing a case-insensitive match in lucene is to
>>> use a Lowercase filter at index and query time. However this does not
>>> preserve the content of the original document. For example if my inverted
>>> index is.
>>>
>>> Term      Doc_1  Doc_2
>>> -------------------------
>>> Quick   |       |  X
>>> The     |   X   |
>>> brown   |   X   |  X
>>> dog     |   X   |
>>> dogs    |       |  X
>>> fox     |   X   |
>>> foxes   |       |  X
>>> in      |       |  X
>>> jumped  |   X   |
>>> lazy    |   X   |  X
>>> leap    |       |  X
>>> over    |   X   |  X
>>> quick   |   X   |
>>> summer  |       |  X
>>> the     |   X   |
>>> ------------------------
>>>
>>> Is it possible to choose between case insensitive/ case sensitive match at
>>> query time. The index is stored in memory in solr. My question is, if this
>>> is stored as a hashmap with string key can I override the hashcode so that
>>> "Quick" and "quick" return the same hash value?
>>>
>>> Has anyone attempted this before? Is my assumption about index right? What
>>> would be the classes and code flow to look at?
>>>
>>>
>>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Apurv Verma <da...@gmail.com>.

Hey Michael,
 Thanks for your reply. My use case is a little different. I would like to
get the original values in facet queries but I would like to apply filter
queries in a case insensitive fashion.

For example  I require facet_query to return Quick, The, brown, ...
But I want filter queries of the form fq=Term:"quick"

Also could you please point me to some additional links on how I can index
different variants of a token at the same position?


--
Regards,
Apurv Verma



On Tue, Nov 25, 2014 at 6:26 PM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> right -- missed Ahmet's answer there in my haste to respond ...
>
> -Mike
>
>
> On 11/25/14 6:56 AM, Ahmet Arslan wrote:
>
>> Hi Apurv,
>>
>> I wouldn't worry about index size, increase in index size is not linear
>> (2x) like that.
>> Please see similar discussion :
>> https://issues.apache.org/jira/browse/LUCENE-5620
>>
>> Ahmet
>>
>>
>> On Tuesday, November 25, 2014 1:46 PM, Ahmet Arslan
>> <io...@yahoo.com.INVALID> wrote:
>>
>>
>>
>> Hi Apurv,
>>
>> You can create an additional field for case sensitive search, and then
>> you can switch at query time. You will have two fields (text_ci and
>> text_lower) with different analysers populated with copyField.
>>
>> Ahmet
>>
>>
>>
>> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com>
>> wrote:
>> Hey all,
>> The standard solution to doing a case-insensitive match in lucene is to
>> use a Lowercase filter at index and query time. However this does not
>> preserve the content of the original document. For example if my inverted
>> index is.
>>
>> Term      Doc_1  Doc_2
>> -------------------------
>> Quick   |       |  X
>> The     |   X   |
>> brown   |   X   |  X
>> dog     |   X   |
>> dogs    |       |  X
>> fox     |   X   |
>> foxes   |       |  X
>> in      |       |  X
>> jumped  |   X   |
>> lazy    |   X   |  X
>> leap    |       |  X
>> over    |   X   |  X
>> quick   |   X   |
>> summer  |       |  X
>> the     |   X   |
>> ------------------------
>>
>> Is it possible to choose between case insensitive/ case sensitive match at
>> query time. The index is stored in memory in solr. My question is, if this
>> is stored as a hashmap with string key can I override the hashcode so that
>> "Quick" and "quick" return the same hash value?
>>
>> Has anyone attempted this before? Is my assumption about index right? What
>> would be the classes and code flow to look at?
>>
>>
>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Apurv Verma <da...@gmail.com>.

Hey Michael,
 Thanks for your reply. My use case is a little different. I would like to
get the original values in facet queries but I would like to apply filter
queries in a case insensitive fashion.

For example  I require facet_query to return Quick, The, brown, ...
But I want filter queries of the form fq=Term:"quick"

Also could you please point me to some additional links on how I can index
different variants of a token at the same position?


--
Regards,
Apurv Verma



On Tue, Nov 25, 2014 at 6:26 PM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> right -- missed Ahmet's answer there in my haste to respond ...
>
> -Mike
>
>
> On 11/25/14 6:56 AM, Ahmet Arslan wrote:
>
>> Hi Apurv,
>>
>> I wouldn't worry about index size, increase in index size is not linear
>> (2x) like that.
>> Please see similar discussion :
>> https://issues.apache.org/jira/browse/LUCENE-5620
>>
>> Ahmet
>>
>>
>> On Tuesday, November 25, 2014 1:46 PM, Ahmet Arslan
>> <io...@yahoo.com.INVALID> wrote:
>>
>>
>>
>> Hi Apurv,
>>
>> You can create an additional field for case sensitive search, and then
>> you can switch at query time. You will have two fields (text_ci and
>> text_lower) with different analysers populated with copyField.
>>
>> Ahmet
>>
>>
>>
>> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com>
>> wrote:
>> Hey all,
>> The standard solution to doing a case-insensitive match in lucene is to
>> use a Lowercase filter at index and query time. However this does not
>> preserve the content of the original document. For example if my inverted
>> index is.
>>
>> Term      Doc_1  Doc_2
>> -------------------------
>> Quick   |       |  X
>> The     |   X   |
>> brown   |   X   |  X
>> dog     |   X   |
>> dogs    |       |  X
>> fox     |   X   |
>> foxes   |       |  X
>> in      |       |  X
>> jumped  |   X   |
>> lazy    |   X   |  X
>> leap    |       |  X
>> over    |   X   |  X
>> quick   |   X   |
>> summer  |       |  X
>> the     |   X   |
>> ------------------------
>>
>> Is it possible to choose between case insensitive/ case sensitive match at
>> query time. The index is stored in memory in solr. My question is, if this
>> is stored as a hashmap with string key can I override the hashcode so that
>> "Quick" and "quick" return the same hash value?
>>
>> Has anyone attempted this before? Is my assumption about index right? What
>> would be the classes and code flow to look at?
>>
>>
>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

right -- missed Ahmet's answer there in my haste to respond ...

-Mike

On 11/25/14 6:56 AM, Ahmet Arslan wrote:
> Hi Apurv,
>
> I wouldn't worry about index size, increase in index size is not linear (2x) like that.
> Please see similar discussion :
> https://issues.apache.org/jira/browse/LUCENE-5620
>
> Ahmet
>
>
> On Tuesday, November 25, 2014 1:46 PM, Ahmet Arslan <io...@yahoo.com.INVALID> wrote:
>
>
>
> Hi Apurv,
>
> You can create an additional field for case sensitive search, and then you can switch at query time. You will have two fields (text_ci and text_lower) with different analysers populated with copyField.
>
> Ahmet
>
>
>
> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com> wrote:
> Hey all,
> The standard solution to doing a case-insensitive match in lucene is to
> use a Lowercase filter at index and query time. However this does not
> preserve the content of the original document. For example if my inverted
> index is.
>
> Term      Doc_1  Doc_2
> -------------------------
> Quick   |       |  X
> The     |   X   |
> brown   |   X   |  X
> dog     |   X   |
> dogs    |       |  X
> fox     |   X   |
> foxes   |       |  X
> in      |       |  X
> jumped  |   X   |
> lazy    |   X   |  X
> leap    |       |  X
> over    |   X   |  X
> quick   |   X   |
> summer  |       |  X
> the     |   X   |
> ------------------------
>
> Is it possible to choose between case insensitive/ case sensitive match at
> query time. The index is stored in memory in solr. My question is, if this
> is stored as a hashmap with string key can I override the hashcode so that
> "Quick" and "quick" return the same hash value?
>
> Has anyone attempted this before? Is my assumption about index right? What
> would be the classes and code flow to look at?
>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Apurv,

I wouldn't worry about index size, increase in index size is not linear (2x) like that. 
Please see similar discussion : 
https://issues.apache.org/jira/browse/LUCENE-5620

Ahmet


On Tuesday, November 25, 2014 1:46 PM, Ahmet Arslan <io...@yahoo.com.INVALID> wrote:



Hi Apurv,

You can create an additional field for case sensitive search, and then you can switch at query time. You will have two fields (text_ci and text_lower) with different analysers populated with copyField.

Ahmet



On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com> wrote:
Hey all,
The standard solution to doing a case-insensitive match in lucene is to
use a Lowercase filter at index and query time. However this does not
preserve the content of the original document. For example if my inverted
index is.

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

Is it possible to choose between case insensitive/ case sensitive match at
query time. The index is stored in memory in solr. My question is, if this
is stored as a hashmap with string key can I override the hashcode so that
"Quick" and "quick" return the same hash value?

Has anyone attempted this before? Is my assumption about index right? What
would be the classes and code flow to look at?

-- 
Regards,
Apurv

Re: Case Insensitive Matching in Solr/Lucene

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

The index size will not increase as quickly as you might think, and is 
not an issue in most cases.  An alternative to two fields, though, is to 
index both upper- and lower-case tokens at the same position in a single 
field, and then to perform no case folding at query time.  There is no 
standard analysis component that does this, but see LUCENE-5620 for more 
discussion; the ticket describes a component that will get you there.

-Mike

On 11/25/14 6:52 AM, Apurv Verma wrote:
> Hii Ahmet,
>   Thanks for your reply. Creating two separate fields is a viable solution
> where one contains the original value and the other contains the lowercased
> value. But this leads to index bloat up. (~ 2x)
> I am looking for any other alternative solutions.
>
>
> --
> Regards,
> Apurv Verma
>
>
>
> On Tue, Nov 25, 2014 at 5:15 PM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
>> Hi Apurv,
>>
>> You can create an additional field for case sensitive search, and then you
>> can switch at query time. You will have two fields (text_ci and text_lower)
>> with different analysers populated with copyField.
>>
>> Ahmet
>>
>>
>> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com>
>> wrote:
>> Hey all,
>> The standard solution to doing a case-insensitive match in lucene is to
>> use a Lowercase filter at index and query time. However this does not
>> preserve the content of the original document. For example if my inverted
>> index is.
>>
>> Term      Doc_1  Doc_2
>> -------------------------
>> Quick   |       |  X
>> The     |   X   |
>> brown   |   X   |  X
>> dog     |   X   |
>> dogs    |       |  X
>> fox     |   X   |
>> foxes   |       |  X
>> in      |       |  X
>> jumped  |   X   |
>> lazy    |   X   |  X
>> leap    |       |  X
>> over    |   X   |  X
>> quick   |   X   |
>> summer  |       |  X
>> the     |   X   |
>> ------------------------
>>
>> Is it possible to choose between case insensitive/ case sensitive match at
>> query time. The index is stored in memory in solr. My question is, if this
>> is stored as a hashmap with string key can I override the hashcode so that
>> "Quick" and "quick" return the same hash value?
>>
>> Has anyone attempted this before? Is my assumption about index right? What
>> would be the classes and code flow to look at?
>>
>> --
>> Regards,
>> Apurv
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Case Insensitive Matching in Solr/Lucene

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

The index size will not increase as quickly as you might think, and is 
not an issue in most cases.  An alternative to two fields, though, is to 
index both upper- and lower-case tokens at the same position in a single 
field, and then to perform no case folding at query time.  There is no 
standard analysis component that does this, but see LUCENE-5620 for more 
discussion; the ticket describes a component that will get you there.

-Mike

On 11/25/14 6:52 AM, Apurv Verma wrote:
> Hii Ahmet,
>   Thanks for your reply. Creating two separate fields is a viable solution
> where one contains the original value and the other contains the lowercased
> value. But this leads to index bloat up. (~ 2x)
> I am looking for any other alternative solutions.
>
>
> --
> Regards,
> Apurv Verma
>
>
>
> On Tue, Nov 25, 2014 at 5:15 PM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
>> Hi Apurv,
>>
>> You can create an additional field for case sensitive search, and then you
>> can switch at query time. You will have two fields (text_ci and text_lower)
>> with different analysers populated with copyField.
>>
>> Ahmet
>>
>>
>> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com>
>> wrote:
>> Hey all,
>> The standard solution to doing a case-insensitive match in lucene is to
>> use a Lowercase filter at index and query time. However this does not
>> preserve the content of the original document. For example if my inverted
>> index is.
>>
>> Term      Doc_1  Doc_2
>> -------------------------
>> Quick   |       |  X
>> The     |   X   |
>> brown   |   X   |  X
>> dog     |   X   |
>> dogs    |       |  X
>> fox     |   X   |
>> foxes   |       |  X
>> in      |       |  X
>> jumped  |   X   |
>> lazy    |   X   |  X
>> leap    |       |  X
>> over    |   X   |  X
>> quick   |   X   |
>> summer  |       |  X
>> the     |   X   |
>> ------------------------
>>
>> Is it possible to choose between case insensitive/ case sensitive match at
>> query time. The index is stored in memory in solr. My question is, if this
>> is stored as a hashmap with string key can I override the hashcode so that
>> "Quick" and "quick" return the same hash value?
>>
>> Has anyone attempted this before? Is my assumption about index right? What
>> would be the classes and code flow to look at?
>>
>> --
>> Regards,
>> Apurv
>>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Apurv Verma <da...@gmail.com>.

Hii Ahmet,
 Thanks for your reply. Creating two separate fields is a viable solution
where one contains the original value and the other contains the lowercased
value. But this leads to index bloat up. (~ 2x)
I am looking for any other alternative solutions.


--
Regards,
Apurv Verma



On Tue, Nov 25, 2014 at 5:15 PM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Apurv,
>
> You can create an additional field for case sensitive search, and then you
> can switch at query time. You will have two fields (text_ci and text_lower)
> with different analysers populated with copyField.
>
> Ahmet
>
>
> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com>
> wrote:
> Hey all,
> The standard solution to doing a case-insensitive match in lucene is to
> use a Lowercase filter at index and query time. However this does not
> preserve the content of the original document. For example if my inverted
> index is.
>
> Term      Doc_1  Doc_2
> -------------------------
> Quick   |       |  X
> The     |   X   |
> brown   |   X   |  X
> dog     |   X   |
> dogs    |       |  X
> fox     |   X   |
> foxes   |       |  X
> in      |       |  X
> jumped  |   X   |
> lazy    |   X   |  X
> leap    |       |  X
> over    |   X   |  X
> quick   |   X   |
> summer  |       |  X
> the     |   X   |
> ------------------------
>
> Is it possible to choose between case insensitive/ case sensitive match at
> query time. The index is stored in memory in solr. My question is, if this
> is stored as a hashmap with string key can I override the hashcode so that
> "Quick" and "quick" return the same hash value?
>
> Has anyone attempted this before? Is my assumption about index right? What
> would be the classes and code flow to look at?
>
> --
> Regards,
> Apurv
>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Apurv Verma <da...@gmail.com>.

Hii Ahmet,
 Thanks for your reply. Creating two separate fields is a viable solution
where one contains the original value and the other contains the lowercased
value. But this leads to index bloat up. (~ 2x)
I am looking for any other alternative solutions.


--
Regards,
Apurv Verma



On Tue, Nov 25, 2014 at 5:15 PM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Apurv,
>
> You can create an additional field for case sensitive search, and then you
> can switch at query time. You will have two fields (text_ci and text_lower)
> with different analysers populated with copyField.
>
> Ahmet
>
>
> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com>
> wrote:
> Hey all,
> The standard solution to doing a case-insensitive match in lucene is to
> use a Lowercase filter at index and query time. However this does not
> preserve the content of the original document. For example if my inverted
> index is.
>
> Term      Doc_1  Doc_2
> -------------------------
> Quick   |       |  X
> The     |   X   |
> brown   |   X   |  X
> dog     |   X   |
> dogs    |       |  X
> fox     |   X   |
> foxes   |       |  X
> in      |       |  X
> jumped  |   X   |
> lazy    |   X   |  X
> leap    |       |  X
> over    |   X   |  X
> quick   |   X   |
> summer  |       |  X
> the     |   X   |
> ------------------------
>
> Is it possible to choose between case insensitive/ case sensitive match at
> query time. The index is stored in memory in solr. My question is, if this
> is stored as a hashmap with string key can I override the hashcode so that
> "Quick" and "quick" return the same hash value?
>
> Has anyone attempted this before? Is my assumption about index right? What
> would be the classes and code flow to look at?
>
> --
> Regards,
> Apurv
>

Re: Case Insensitive Matching in Solr/Lucene

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Apurv,

You can create an additional field for case sensitive search, and then you can switch at query time. You will have two fields (text_ci and text_lower) with different analysers populated with copyField.

Ahmet


On Tuesday, November 25, 2014 1:39 PM, Apurv Verma <ap...@bloomreach.com> wrote:
Hey all,
The standard solution to doing a case-insensitive match in lucene is to
use a Lowercase filter at index and query time. However this does not
preserve the content of the original document. For example if my inverted
index is.

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

Is it possible to choose between case insensitive/ case sensitive match at
query time. The index is stored in memory in solr. My question is, if this
is stored as a hashmap with string key can I override the hashcode so that
"Quick" and "quick" return the same hash value?

Has anyone attempted this before? Is my assumption about index right? What
would be the classes and code flow to look at?

-- 
Regards,
Apurv

Re: Case Insensitive Matching in Solr/Lucene

Posted by "Heyde, Ralf" <ra...@zalando.de>.

Simply take 2 fields for sensitive and in-sensitive selection
Am 25.11.2014 12:39 schrieb "Apurv Verma" <ap...@bloomreach.com>:

> Hey all,
>  The standard solution to doing a case-insensitive match in lucene is to
> use a Lowercase filter at index and query time. However this does not
> preserve the content of the original document. For example if my inverted
> index is.
>
> Term      Doc_1  Doc_2
> -------------------------
> Quick   |       |  X
> The     |   X   |
> brown   |   X   |  X
> dog     |   X   |
> dogs    |       |  X
> fox     |   X   |
> foxes   |       |  X
> in      |       |  X
> jumped  |   X   |
> lazy    |   X   |  X
> leap    |       |  X
> over    |   X   |  X
> quick   |   X   |
> summer  |       |  X
> the     |   X   |
> ------------------------
>
> Is it possible to choose between case insensitive/ case sensitive match at
> query time. The index is stored in memory in solr. My question is, if this
> is stored as a hashmap with string key can I override the hashcode so that
> "Quick" and "quick" return the same hash value?
>
> Has anyone attempted this before? Is my assumption about index right? What
> would be the classes and code flow to look at?
>
> --
> Regards,
> Apurv
>