You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Wayne W <wa...@gmail.com> on 2014/10/01 13:16:37 UTC

Wildcard search makes no sense!!

Hi,

I don't understand this at all. We are indexing some contact names. When we
do a standard query:

query 1: capi*
result: Capital Health

query 2: capit*
result: Capital Health

query 3: capita*
result: <no results>

query 4: capital*
result: <no results>

I understand (as we are using solar 3.5) that the wildcard search does not
actually return the query without the wildcard so I understand at least why
query 4 is not working ( I need to use: capital* OR capital ). What I don't
understand is why query 3 is not working.

Also if we place in the text field the following 3 contacts:

jo@capitalhealth.com
fred@capitalhealth.com
Capital Heath

When searching for:

query A: capita*
result: jo@capitalhealth.com, fred@capitalhealth.com

query B: capit*
result: jo@capitalhealth.com, fred@capitalhealth.com, Capital Heath


What is going on and how can I solve this?
many thanks as I'm really stuck on this

Re: Wildcard search makes no sense!!

Posted by waynemailinglist <wa...@gmail.com>.
Ok I think I understand your points there. Just clarify say if the term was
"Large increased" and my filters went something like:

Large|increased
Large|increase|increased
large|increase|increased

the final tokens indexed would be large|increase|increased  ?

Once again thanks for all the help.


On Thu, Oct 2, 2014 at 2:30 PM, Shawn Heisey-2 [via Lucene] <
ml-node+s472066n4162306h96@n3.nabble.com> wrote:

> On 10/2/2014 4:33 AM, waynemailinglist wrote:
>
> > Something that is still not clear in my mind is how this tokenising
> works.
> > For example with the filters I have when I run the analyser I get:
> > Field: Hello You
> >
> > Hello|You
> > Hello|You
> > Hello|You
> > hello|you
> > hello|you
> >
> >
> > Does this mean that the index is stored as 'hello|you' (the final one)
> and
> > that when I run a query and it goes through the filters whatever the end
> > result of that is must match the 'hello|you' in order to return a
> result?
>
> The index has two terms for this field if this is the whole input --
> hello and you -- which can be searched for individually.  The tokenizer
> does the initial job of separating the input into tokens (terms) ...
> some filters can create additional terms, depending on exactly what's
> left when the tokenizer is done.
>
> Thanks,
> Shawn
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Wildcard-search-makes-no-sense-tp4162069p4162306.html
>  To unsubscribe from Wildcard search makes no sense!!, click here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4162069&code=d2F5bmVtYWlsaW5nbGlzdHNAZ21haWwuY29tfDQxNjIwNjl8LTIxOTMxNzkyNQ==>
> .
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-search-makes-no-sense-tp4162069p4162349.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard search makes no sense!!

Posted by Erick Erickson <er...@gmail.com>.
right, prior to 3.6, the standard way to handle wildcards was to,
essentially, pre-analyze the terms that had  wildcards. This works
fine for simple filters, things like lowercasing for instance, but
doesn't work so well for things like stemming.

So you're doing what can be done at this point, but moving to 4.x (or
even 3.6) would solve it better.

Best,
Erick

On Thu, Oct 2, 2014 at 6:29 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 10/2/2014 4:33 AM, waynemailinglist wrote:
>> Something that is still not clear in my mind is how this tokenising works.
>> For example with the filters I have when I run the analyser I get:
>> Field: Hello You
>>
>> Hello|You
>> Hello|You
>> Hello|You
>> hello|you
>> hello|you
>>
>>
>> Does this mean that the index is stored as 'hello|you' (the final one) and
>> that when I run a query and it goes through the filters whatever the end
>> result of that is must match the 'hello|you' in order to return a result?
>
> The index has two terms for this field if this is the whole input --
> hello and you -- which can be searched for individually.  The tokenizer
> does the initial job of separating the input into tokens (terms) ...
> some filters can create additional terms, depending on exactly what's
> left when the tokenizer is done.
>
> Thanks,
> Shawn
>

Re: Wildcard search makes no sense!!

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/2/2014 4:33 AM, waynemailinglist wrote:
> Something that is still not clear in my mind is how this tokenising works.
> For example with the filters I have when I run the analyser I get:
> Field: Hello You
> 
> Hello|You
> Hello|You
> Hello|You
> hello|you
> hello|you
> 
> 
> Does this mean that the index is stored as 'hello|you' (the final one) and
> that when I run a query and it goes through the filters whatever the end
> result of that is must match the 'hello|you' in order to return a result?

The index has two terms for this field if this is the whole input --
hello and you -- which can be searched for individually.  The tokenizer
does the initial job of separating the input into tokens (terms) ...
some filters can create additional terms, depending on exactly what's
left when the tokenizer is done.

Thanks,
Shawn


Re: Wildcard search makes no sense!!

Posted by waynemailinglist <wa...@gmail.com>.
Many many thanks for the replies - it was helpful for me to start
understanding how this works.

I'm using 3.5 so this goes to explain a lot. What I have done is if the
query contains a * I make the query lowercase before sending to solr. This
seems to have solved this issue given your explanation above. Many thanks 

Something that is still not clear in my mind is how this tokenising works.
For example with the filters I have when I run the analyser I get:
Field: Hello You

Hello|You
Hello|You
Hello|You
hello|you
hello|you


Does this mean that the index is stored as 'hello|you' (the final one) and
that when I run a query and it goes through the filters whatever the end
result of that is must match the 'hello|you' in order to return a result?






--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-search-makes-no-sense-tp4162069p4162284.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard search makes no sense!!

Posted by Erick Erickson <er...@gmail.com>.
Two things:

1> what version of Solr are you using? If it's prior to 3.6, then the
bits that handle applying lowercaseFilter to wildcards isn't in the
code.

2> what do you see if you add &debug=query?

I just tried it with your analysis chain and it seemed to work. Did
you completely blow your index away when trying this? I did get into a
state where my terms didn't show up. When you change the schema,
sometimes some information about the fields is written into the index
and is incompatible with later changes.

By "completely blow away" I mean
stop Solr
rm -rf blah/collection/data
start Solr
reindex
test


Best,
Erick

On Wed, Oct 1, 2014 at 10:10 AM, waynemailinglist
<wa...@gmail.com> wrote:
> I'm still stuck on this actually. I would really appreciate any pointers.
> If I search for :
> query 1: Κώστας
> result: Κώστας
>
> query 2: Κώστα*
> result: <no result>
>
> I've looked at the analyser but I don't really understand what I'm looking
> at if I'm honest. It gives the output:
> Field (name): title
> Field value: Κώστας
> Field value (query): Κώστα*
>
> Index Analyzer
> Κώστας
> Κώστας
> Κώστας
> κώστας
> κώστας
> Query Analyzer
> Κώστα*
> Κώστα*
> Κώστα*
> Κώστα
> κώστα
> κώστα
>
>
> In my schema I have defined
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/> (only used in query)
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>
> I tried adding ASCIIFoldingFilterFactory but that didm;t make any difference
> after reindexing.
>
> Any ideas?
>
> many thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-search-makes-no-sense-tp4162069p4162150.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard search makes no sense!!

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
If you use "*" you use Multiterm analysis path, which is semi-hidden
and is a lot more limited to the things done with normal tokens:
https://wiki.apache.org/solr/MultitermQueryAnalysis

The Analyzer components that are NOT multiterm aware cannot be used
that way. Looking at: http://www.solr-start.com/info/analyzers/ , you
can see that only LowerCase analyzer is multiterm aware (with (multi)
in the brackets). So, the rest are not used.

You may switch to EdgeNGrams or similar instead.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 1 October 2014 13:10, waynemailinglist <wa...@gmail.com> wrote:
> I'm still stuck on this actually. I would really appreciate any pointers.
> If I search for :
> query 1: Κώστας
> result: Κώστας
>
> query 2: Κώστα*
> result: <no result>
>
> I've looked at the analyser but I don't really understand what I'm looking
> at if I'm honest. It gives the output:
> Field (name): title
> Field value: Κώστας
> Field value (query): Κώστα*
>
> Index Analyzer
> Κώστας
> Κώστας
> Κώστας
> κώστας
> κώστας
> Query Analyzer
> Κώστα*
> Κώστα*
> Κώστα*
> Κώστα
> κώστα
> κώστα
>
>
> In my schema I have defined
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/> (only used in query)
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>
> I tried adding ASCIIFoldingFilterFactory but that didm;t make any difference
> after reindexing.
>
> Any ideas?
>
> many thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-search-makes-no-sense-tp4162069p4162150.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard search makes no sense!!

Posted by waynemailinglist <wa...@gmail.com>.
I'm still stuck on this actually. I would really appreciate any pointers. 
If I search for :
query 1: Κώστας
result: Κώστας

query 2: Κώστα*
result: <no result>

I've looked at the analyser but I don't really understand what I'm looking
at if I'm honest. It gives the output:
Field (name): title
Field value: Κώστας
Field value (query): Κώστα*

Index Analyzer
Κώστας
Κώστας
Κώστας
κώστας
κώστας
Query Analyzer
Κώστα*
Κώστα*
Κώστα*
Κώστα
κώστα
κώστα


In my schema I have defined
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/> (only used in query)
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>


I tried adding ASCIIFoldingFilterFactory but that didm;t make any difference
after reindexing.

Any ideas?

many thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-search-makes-no-sense-tp4162069p4162150.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard search makes no sense!!

Posted by waynemailinglist <wa...@gmail.com>.
Ahmet -  many thanks - I removed the EnglishPorterFilterFactory and reindexed
and this seems to behave as expected now.

Jack - thanks aswell - I'm very much a noob with this, and thats a great
tip.



--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-search-makes-no-sense-tp4162069p4162086.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard search makes no sense!!

Posted by Jack Krupansky <ja...@basetechnology.com>.
The presence of a wildcard in a query term short circuits some portions of 
the analysis process. Some token filters like lower case can still be 
performed on the query terms, but others, like stemming, cannot. So, either 
simplify the analysis (be more selective of what token filters you use), or 
you will have to modify your query terms so that you manually simulate the 
token transformations that your text analysis is performing.

Take one of your indexed terms that you think should match and send it 
through the Solr Admin UI analysis page for the query field and see what the 
source token gets analyzed into - that's what your wildcard prefix must 
match. Sometimes (usually!) you will be surprised.

-- Jack Krupansky

-----Original Message----- 
From: Wayne W
Sent: Wednesday, October 1, 2014 7:16 AM
To: solr-user@lucene.apache.org
Subject: Wildcard search makes no sense!!

Hi,

I don't understand this at all. We are indexing some contact names. When we
do a standard query:

query 1: capi*
result: Capital Health

query 2: capit*
result: Capital Health

query 3: capita*
result: <no results>

query 4: capital*
result: <no results>

I understand (as we are using solar 3.5) that the wildcard search does not
actually return the query without the wildcard so I understand at least why
query 4 is not working ( I need to use: capital* OR capital ). What I don't
understand is why query 3 is not working.

Also if we place in the text field the following 3 contacts:

jo@capitalhealth.com
fred@capitalhealth.com
Capital Heath

When searching for:

query A: capita*
result: jo@capitalhealth.com, fred@capitalhealth.com

query B: capit*
result: jo@capitalhealth.com, fred@capitalhealth.com, Capital Heath


What is going on and how can I solve this?
many thanks as I'm really stuck on this 


Re: Wildcard search makes no sense!!

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2014-10-01 at 13:16 +0200, Wayne W wrote:
> query 2: capit*
> result: Capital Health
> 
> query 3: capita*
> result: <no results>

You are likely using a stemmer for the field: "Capital Health" gets
indexed as "capit" and "health", so there are no tokens starting with
"capita".

Turn off the stemmer or add a non-stemmed copy-field for trunkated
searches.


(sanity-checked at http://9ol.es/porter_js_demo.html)


- Toke Eskildsen, State and University Library, Denmark



Re: Wildcard search makes no sense!!

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

Probably you have stemmer and it is eating up Capital to capit. Thats the reason.
Either remove stemmer from analyser chain or add keyword repeat filter.

Ahmet



On Wednesday, October 1, 2014 2:16 PM, Wayne W <wa...@gmail.com> wrote:
Hi,

I don't understand this at all. We are indexing some contact names. When we
do a standard query:

query 1: capi*
result: Capital Health

query 2: capit*
result: Capital Health

query 3: capita*
result: <no results>

query 4: capital*
result: <no results>

I understand (as we are using solar 3.5) that the wildcard search does not
actually return the query without the wildcard so I understand at least why
query 4 is not working ( I need to use: capital* OR capital ). What I don't
understand is why query 3 is not working.

Also if we place in the text field the following 3 contacts:

jo@capitalhealth.com
fred@capitalhealth.com
Capital Heath

When searching for:

query A: capita*
result: jo@capitalhealth.com, fred@capitalhealth.com

query B: capit*
result: jo@capitalhealth.com, fred@capitalhealth.com, Capital Heath


What is going on and how can I solve this?
many thanks as I'm really stuck on this