You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by hermida <le...@gmail.com> on 2009/12/04 19:22:25 UTC

how to do auto-suggest case-insensitive match and return original case field values

Hi everyone,

New to forum and to Solr, doing my first major project with it and enjoying
it so far, great software.

In my web application I want to set up auto-suggest as you type
functionality which will search case-insensitively yet return the original
case terms.  It doesn't seem like TermsComponent can do this as it can only
return the lowercase indexed terms your are searching against, not the
original case terms.

There was one post on this forum 
http://old.nabble.com/Auto-suggest..-how-to-do-mixed-case-td24106666.html#a24143981
http://old.nabble.com/Auto-suggest..-how-to-do-mixed-case-td24106666.html#a24143981 
where someone asked the same question, and what someone said is to

There is no way to do this right now using TermsComponent. You can index
lower case terms and store the mixed case terms. Then you can use a prefix
query which will return documents (and hence stored field values).

So this got me started, I set out to use Solr Query instead of
TermsComponent to try to do this.  I did the following as mentioned:

<fieldType name="test" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>
</fieldType>

<fieldType name="test_lc" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<field name="test" type="test" indexed="false" stored="true"
multiValued="true" />
<field name="test_lc" type="test_lc" indexed="true"  stored="false"
multiValued="true" />

And used copyField to populate the test_lc field:

<copyField source="test" dest="test_lc"/>

This is the easy part (the forum user didn't explain the hard part!) It is
very hard to get the same information that TermsComponent returns using the
regular Solr Query functionality!  For example:

http://localhost:8983/solr/terms?terms.fl=test_lc&terms.prefix=a&terms.sort=count&terms.limit=5&omitHeader=true

<int name="a-kinase anchor protein 13">15</int>
<int name="accn5">6</int>
<int name="actin-binding">3</int>
<int name="activator">1</int>
<int name="agie-bp1">1</int>

which provides useful sorting by and returning of term frequency counts in
your index.  How does one get this same information with regular Solr Query? 
I set up the following prefix query, searching by the indexed lowercased
field and returning the other:

http://localhost:8983/solr/select?fl=test&q=test_lc%3Aa*&sort=score+desc&rows=5&omitHeader=true

<doc>
  <arr name="test">
    <str>3D-structure</str>
    <str>acetylation</str>
    <str>alternative promoter usage</str>
    <str>HLC-7</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>alternative splicing</str>
    <str>complete proteome</str>
    <str>DNA-binding</str>
    <str>RACK1</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>acetylation</str>
    <str>AIG21</str>
    <str>WD repeat</str>
    <str>GNB2L1</str>
  </arr>
</doc>
<doc>
</arr>
  <arr name="test">
    <str>3D-structure</str>
    <str>apoptosis</str>
    <str>cathepsin G-like 1</str>
    <str>ATSGL1</str>
    <str>CTLA-1</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>autoantigen Ge-1</str>
    <str>autoantigen RCD-8</str>
    <str>HERV-H LTR-associating protein 3</str>
    <str>HHLA3</str>
  </arr>
</doc>

I can see how to process this in my front-end app to extract the original
terms starting with the prefix letter(s) used in the query, but there are
still some major problems when compared to TermsComponent:

- How do I make sure my auto-suggest list is at least a certain number of
terms long?  Using rows of course doesn't work like terms.limit, because
between returned docs there can be the same term and these will get
collapsed.
- How do I get term frequency counts like TermsComponent does?  I looked at
faceting but I don't understand how to get the TermsComponent behavior using
it.

Sorry for the long message, just wanted to fully explain, thanks for any
help!

leandro

-- 
View this message in context: http://old.nabble.com/how-to-do-auto-suggest-case-insensitive-match-and-return-original-case-field-values-tp26636365p26636365.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do auto-suggest case-insensitive match and return original case field values

Posted by hermida <le...@gmail.com>.
Hello,

Thanks for the reply (see below)


hossman wrote:
> 
> The type of approach you are describing (doing a prefix based query for 
> autosuggest) probably won't work very well unless your index is 100% 
> designed just for the autosuggest ... if it's an index about products, and 
> you're just using one of hte fields for autosuggest, you aren't going to 
> get good autosuggest results because the same word is going to appear in 
> multiple products.  what you need is an index of *words* that you want to 
> autosuggest, with fields indicating how important those words are that you 
> can use in a function query (this replaces the term freq that 
> TermComponent would use)
> 
> the fact that your "test" field is multivalued and stores widly different 
> things in each doc is an example of what i mean.
> 

I am using Solr to index biological annotations about proteins (which my
documents). There is no tokenization or special analysis of the annotation
text strings as they are not free text, each annotation is a single token. 
Also, for the purpose of my auto-suggest and searching there are actually no
different types of annotations, that's why they all go into the same
multivalued field for each protein document.  I want to use the auto-suggest
and search to help biologists (who know the annotation terminology) find all
the protein documents with the annotation they are thinking of, and to
suggest what is available as they type.  The thing is that in my field
letter case can be important define the meaning of an annotation, but the
biologist might not remember the exact case.  Therefore I want them to be
able to type in what ever case and the auto-suggest will pull up as they
type annotations with the correct case to assist them.

Let's just take the fundamental question, independent of any example:  is it
possible to do a case-insensitive prefix search using faceting (to get the
term suggestions) that also returns the originally mixed case terms of *all*
those terms listed in lowercase in the facet list?  The only other post I
saw in this forum on this topic a user seemed to think this was easily
doable, but I don't think they actually tried to do it because the faceted
search doesn't seem possible, you run into all these problems.  It just
isn't something Solr/Lucene can actually do the way it is organized.


hossman wrote:
> 
> Have you considered the possibility of just indexing the lowercase value 
> concatenated with the regular case value using a special delimiter, and 
> ten returning to your TermComponent based solution?  index "PowerPoint" 
> as "powerpoint|PowerPoint" and just split on the "\" character when you 
> get hte data back from your prefix based term lookup.
> 

I think this is a good workaround, will definitely try it!

leandro

-- 
View this message in context: http://old.nabble.com/how-to-do-auto-suggest-case-insensitive-match-and-return-original-case-field-values-tp26636365p26701111.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do auto-suggest case-insensitive match and return original case field values

Posted by hermida <le...@gmail.com>.
Hello,

Watched the JIRA issue and saw that it got commited recently.  Just tested
it and it works *perfectly*, thanks Uri adding such a nice feature to Solr!

For other users out there who want to do this:

1. Download the latest nightly build of Solr 1.5-dev at
http://people.apache.org/builds/lucene/solr/nightly/
2. For the index field you were using to do terms auto-suggest, rebuild it
without using LowercaseFilterFactory so that it indexes the original mixed
case terms
3. In your terms HTTP GET URL, replace terms.prefix=abc (where abc is
actually what the user is typing in) with

terms.regex=%5Eabc.%2A&terms.regex.flag=case_insensitive

where %5E = ^ and %2A = *

Voila!


hermida wrote:
> 
> 
> Uri Boness wrote:
>> 
>> Just updated SOLR-1625 to support regexp hints.
>> 
>> https://issues.apache.org/jira/browse/SOLR-1625
>> 
>> Cheers,
>> Uri
>> 
> 
> This is perfect, exactly what is needed to make this functionality
> possible.  Is the patch already in trunk? 
> 
> thanks,
> leandro
> 

-- 
View this message in context: http://old.nabble.com/how-to-do-auto-suggest-w--case-insensitive-search-and-suggesting-original-mixed-case-field-values-tp26636365p26775120.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do auto-suggest case-insensitive match and return original case field values

Posted by hermida <le...@gmail.com>.

Uri Boness wrote:
> 
> Just updated SOLR-1625 to support regexp hints.
> 
> https://issues.apache.org/jira/browse/SOLR-1625
> 
> Cheers,
> Uri
> 

This is perfect, exactly what is needed to make this functionality possible. 
Is the patch already in trunk? 

thanks,
leandro
-- 
View this message in context: http://old.nabble.com/how-to-do-auto-suggest-w--case-insensitive-search-and-suggesting-original-mixed-case-field-values-tp26636365p26706241.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do auto-suggest case-insensitive match and return original case field values

Posted by Uri Boness <ub...@gmail.com>.
Just updated SOLR-1625 to support regexp hints.

https://issues.apache.org/jira/browse/SOLR-1625

Cheers,
Uri

Chris Hostetter wrote:
> : In my web application I want to set up auto-suggest as you type
> : functionality which will search case-insensitively yet return the original
> : case terms.  It doesn't seem like TermsComponent can do this as it can only
> : return the lowercase indexed terms your are searching against, not the
> 	...
> : which provides useful sorting by and returning of term frequency counts in
> : your index.  How does one get this same information with regular Solr Query? 
> : I set up the following prefix query, searching by the indexed lowercased
> : field and returning the other:
>
> The type of approach you are describing (doing a prefix based query for 
> autosuggest) probably won't work very well unless your index is 100% 
> designed just for the autosuggest ... if it's an index about products, and 
> you're just using one of hte fields for autosuggest, you aren't going to 
> get good autosuggest results because the same word is going to appear in 
> multiple products.  what you need is an index of *words* that you want to 
> autosuggest, with fields indicating how important those words are that you 
> can use in a function query (this replaces the term freq that 
> TermComponent would use)
>
> the fact that your "test" field is multivalued and stores widly different 
> things in each doc is an example of what i mean.
>
> Have you considered the possibility of just indexing the lowercase value 
> concatenated with the regular case value using a special delimiter, and 
> ten returning to your TermComponent based solution?  index "PowerPoint" 
> as "powerpoint|PowerPoint" and just split on the "\" character when you 
> get hte data back from your prefix based term lookup.
>
>
> -Hoss
>
>
>   

Re: how to do auto-suggest case-insensitive match and return original case field values

Posted by Chris Hostetter <ho...@fucit.org>.
: In my web application I want to set up auto-suggest as you type
: functionality which will search case-insensitively yet return the original
: case terms.  It doesn't seem like TermsComponent can do this as it can only
: return the lowercase indexed terms your are searching against, not the
	...
: which provides useful sorting by and returning of term frequency counts in
: your index.  How does one get this same information with regular Solr Query? 
: I set up the following prefix query, searching by the indexed lowercased
: field and returning the other:

The type of approach you are describing (doing a prefix based query for 
autosuggest) probably won't work very well unless your index is 100% 
designed just for the autosuggest ... if it's an index about products, and 
you're just using one of hte fields for autosuggest, you aren't going to 
get good autosuggest results because the same word is going to appear in 
multiple products.  what you need is an index of *words* that you want to 
autosuggest, with fields indicating how important those words are that you 
can use in a function query (this replaces the term freq that 
TermComponent would use)

the fact that your "test" field is multivalued and stores widly different 
things in each doc is an example of what i mean.

Have you considered the possibility of just indexing the lowercase value 
concatenated with the regular case value using a special delimiter, and 
ten returning to your TermComponent based solution?  index "PowerPoint" 
as "powerpoint|PowerPoint" and just split on the "\" character when you 
get hte data back from your prefix based term lookup.


-Hoss


Re: how to do auto-suggest case-insensitive match and return original case field values

Posted by hermida <le...@gmail.com>.
Hi again,

Just pinging again to any Solr experts out there... sorry that my previous
message was a bit long (wanted to fully explain what I've already done and
where the exact difficulty arises)... but to summarize:

Does anyone know how to use Solr querying with faceting to do an
auto-suggest that search case-insensitively yet returns the original mixed
case values???

thanks for any help,
Leandro

-- 
View this message in context: http://old.nabble.com/how-to-do-auto-suggest-case-insensitive-match-and-return-original-case-field-values-tp26636365p26694224.html
Sent from the Solr - User mailing list archive at Nabble.com.