You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by PeterKerk <pe...@hotmail.com> on 2014/10/01 20:52:24 UTC

Re: Flexible search field analyser/tokenizer configuration

Ok, I missed the Query tab where I can do the actual site search :)

I've also used your links, but even with those I fail to grasp why the
following is happening:

This is my query:
http://localhost:8983/solr/bm/select?q=*%3A*&fq=The+Royal+Garden&rows=50&fl=id%2Ctitle&wt=xml&indent=true


And below the result.
Notice how results that have "the" in their title are also returned...words
like "the", "a", "in" in general are words I wish to ignore IF the rest of
the title does not match.
And now with my query "The Royal Garden", I have a result that is an exact
match on all 3 words, but that result is listed all the way at the bottom.
How can I prevent:

a) make sure that items that only share the words I want to ignore "the",
"a" etc. are not being returned
b) make sure that the exact match is at the top of the results and only
after that the partial matches, so that the 1st results would be "The Royal
Garden" and the 2nd result would be "Royal"

Thanks!

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="fl">id,title</str>
    <str name="indent">true</str>
    <str name="q">*:*</str>
    <str name="_">1412188632532</str>
    <str name="wt">xml</str>
    <str name="fq">The Royal Garden</str>
    <str name="rows">60</str>
  </lst>
</lst>
<result name="response" numFound="9" start="0">
  <doc>
    <str name="id">1579</str>
    <str name="title">Royal</str></doc>
  <doc>
    <str name="id">1603</str>
    <str name="title">The Blue Lagoon</str></doc>
  <doc>
    <str name="id">1629</str>
    <str name="title">The Nightingale DJ Light Sound Vision</str></doc>
  <doc>
    <str name="id">1648</str>
    <str name="title">The Swingmasters</str></doc>
  <doc>
    <str name="id">2431</str>
    <str name="title">The Cover Band</str></doc>
  <doc>
    <str name="id">2457</str>
    <str name="title">The Teahouse Company</str></doc>
  <doc>
    <str name="id">2493</str>
    <str name="title">The Task - Ultimate Party Band</str></doc>
  <doc>
    <str name="id">2499</str>
    <str name="title">The Royal Garden</str></doc>
  <doc>
    <str name="id">2500</str>
    <str name="title">The Wall</str></doc>
</result>
</response>



--
View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162174.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by Erick Erickson <er...@gmail.com>.

Peter:

You're still missing the boat a bit here with your boosts. Boosts
applied to "fq" clauses are _completely and totally useless_. Don't
even bother putting them, the just confuse me ;).

fq clauses are simple binary decisions and do NOT contribute to
scoring in any way at all. Way under the covers, the query is
evaluated and ancoded in a bitset over the internal Lucene doc IDs.
Each doc that matches results in a 1 in the appropriate place.... This
bitset is what goes in the filterCache. Which, incidentally, is why
the size of each entry is (some overhead) + maxDoc/8.....

Anyway, the entire result of the calculations is just this bit, so
there's no room for scoring information. Now, putting boosts in the fq
clause doesn't change the results, but if you're expecting them to
have any effect on the query you'll be disappointed.

Best,
Erick

On Sat, Oct 4, 2014 at 9:17 AM, PeterKerk <pe...@hotmail.com> wrote:
> Thanks, removing the fq parameters helped :)
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162667.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by PeterKerk <pe...@hotmail.com>.

Thanks, removing the fq parameters helped :)



--
View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162667.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by Jack Krupansky <ja...@basetechnology.com>.

Thanks for the clarification. Now... "fq" is simply another query, with 
normal query syntax. You wrote two field names as if they were query terms, 
but that's not meaningful query syntax. Sorry, but there is no such feature 
in Solr.

Although the qf parameter of dismax and edismax can be used to apply a boost 
to all un-fielded terms for a field, you otherwise need to apply any boost 
on a term, not a field.

-- Jack Krupansky

-----Original Message----- 
From: PeterKerk
Sent: Saturday, October 4, 2014 10:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Flexible search field analyser/tokenizer configuration

In Engish, I think this part:
(title_search_global:(Ballonnenboog) OR
title_search_global:"Ballonnenboog"^100)
is looking for a match on "Ballonenboog" in the title and give a boost if it
occurs exactly as this.

The second part does the same but then for the description_search field, and
with an OR operator (so I would think it would not eliminate all matches:

(description_search:(Ballonnenboog) OR
description_search:"Ballonnenboog"^100)

And finally this part:

title_search_global^10.0+description_search^0.3

Gives a higher boost to the occurrence of the query in title_search_global
field than description_search field.

But something must be wrong with my analysis :)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162660.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by PeterKerk <pe...@hotmail.com>.

In Engish, I think this part:
(title_search_global:(Ballonnenboog) OR
title_search_global:"Ballonnenboog"^100)
is looking for a match on "Ballonenboog" in the title and give a boost if it
occurs exactly as this.

The second part does the same but then for the description_search field, and
with an OR operator (so I would think it would not eliminate all matches:

(description_search:(Ballonnenboog) OR
description_search:"Ballonnenboog"^100)

And finally this part:

title_search_global^10.0+description_search^0.3

Gives a higher boost to the occurrence of the query in title_search_global
field than description_search field.

But something must be wrong with my analysis :)



--
View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162660.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by Jack Krupansky <ja...@basetechnology.com>.

What exactly do you think that filter query is doing? Explain it in plain 
English.

My guess is that it eliminates all your document matches.

-- Jack Krupansky

-----Original Message----- 
From: PeterKerk
Sent: Saturday, October 4, 2014 12:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Flexible search field analyser/tokenizer configuration

Ok, that field now totally works, thanks again!

I've removed the wildcard to benefit from ranking and boosting and am now
trying to combine this field with another, but I have some difficulties
figuring out the right query.

I want to search on the occurence of the keyword in the title field
(title_search_global) of a document OR in the description field
(description_search)
and if it occurs in the title field give that the largest boost, over a
minor boost in the description_search field.

Here's what I have now on query "Ballonnenboog"

http://localhost:8983/solr/tt-shop/select?q=(title_search_global%3A(Ballonnenboog)+OR+title_search_global%3A%22Ballonnenboog%22%5E100)+OR+description_search%3A(Ballonnenboog)&fq=title_search_global%5E10.0%2Bdescription_search%5E0.3&fl=id%2Ctitle&wt=xml&indent=true

But it returns 0 results, even though there are results that have
"Ballonnenboog" in the title_search_global field.

What am I missing?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162638.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by PeterKerk <pe...@hotmail.com>.

Ok, that field now totally works, thanks again!

I've removed the wildcard to benefit from ranking and boosting and am now
trying to combine this field with another, but I have some difficulties
figuring out the right query.

I want to search on the occurence of the keyword in the title field
(title_search_global) of a document OR in the description field
(description_search)
and if it occurs in the title field give that the largest boost, over a
minor boost in the description_search field.

Here's what I have now on query "Ballonnenboog"

http://localhost:8983/solr/tt-shop/select?q=(title_search_global%3A(Ballonnenboog)+OR+title_search_global%3A%22Ballonnenboog%22%5E100)+OR+description_search%3A(Ballonnenboog)&fq=title_search_global%5E10.0%2Bdescription_search%5E0.3&fl=id%2Ctitle&wt=xml&indent=true

But it returns 0 results, even though there are results that have
"Ballonnenboog" in the title_search_global field.

What am I missing?



--
View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162638.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by Erick Erickson <er...@gmail.com>.

bq: But with my new query, could I just remove the defType=lucene parameter and
the wildcard right

Well, It Depends (tm). You can specify the query parser as part of
the requestHandler and, indeed, leave it off the query.

As far as the wildcard goes, it also depends. You'll change the semantics
of the search for sure, which may be A Good Thing. Without knowing
_why_ the wildcard is put there it's hard to say.

In general, it's usually better to concentrate on the analysis
chain than make up for faulty analysis with wildcards. For instance,
let's claim you want to search for "run" and match docs that have
"runs", "runner" or "running" in a field. One way of doing this is to
just search for "run*". A better way is to include an aggressive stemmer
in the fieldType that reduces "running", "runs" and "runner" to "run" at
both query and index time. Now, searching for any of "run", "runner",
"running", "runs" will match any of "run", "runs", "running" or "runner".

And wildcards _also_ stop scoring from happening, so with techniques
like the above you also get better relevance scoring.

Best,
Erick

On Wed, Oct 1, 2014 at 9:01 PM, PeterKerk <pe...@hotmail.com> wrote:
> Sorry, one final thing.
>
> In my current application I search like this:
> "&q=title:<searchquery>*&defType=lucene
>
> I was checking here: http://wiki.apache.org/solr/SolrQuerySyntax
>
> But with my new query, could I just remove the defType=lucene parameter and
> the wildcard right? Or am I overlooking something then?
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162250.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by PeterKerk <pe...@hotmail.com>.

Sorry, one final thing.

In my current application I search like this: 
"&q=title:<searchquery>*&defType=lucene

I was checking here: http://wiki.apache.org/solr/SolrQuerySyntax

But with my new query, could I just remove the defType=lucene parameter and
the wildcard right? Or am I overlooking something then?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162250.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by PeterKerk <pe...@hotmail.com>.

You were right, I had an old configuration :)
But using your new suggestions had made that it works! Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162249.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by Erick Erickson <er...@gmail.com>.

1>
Hmmm, you _should_ have some line like:
<requestHandler name="/select" class="solr.SearchHandler">
in solrconfig.xml, otherwise the url you posted has no
destination.

http://localhost:8983/solr/bm/select
implies that there's a request handler to, well, handle it so I'm
puzzled.

When you _do_ find the request handler, there should be a line like
this:
<str name="df">text</str>

that defines the default text field. It's vaguely possible that you
have something like:
<defaultSearchField>text</defaultSearchField>
in your schema.xml file, which would mean you copied stuff from
an old Solr (pre 3.6) schema file.

If you don't find any of those, please post your solrconfig.xml file
and we can look for it...

2> I'm getting mixed up by the term "exclude" ;).

If you want to have both Dutch and English stopwords removed, you
can, of course, put them in the same stopwords file.

If you mean that you want to remove different stopwords in different
languages, you need to define two field types and thus two fields.

3a> stopwords at both index and query time will do this.

3b> First, you need to stop using the fq parameter and move it to the
q parameter, as
q=title:(the royal garden)

That should move things more like you wish. Using boosts can also help.
But, don't get too hung up on exact ordering. For small numbers of documents
and short fields, the tf/idf calculations can lose the ranking because of
essentially rounding errors.

You can also use a boost query that is a phrase. By that I mean
something adding
OR title:"The Royal Garden"^100

That would tend to force anything with the exact sequence of words
"The Royal Garden" way up in the list. If stopwords are removed
in your fieldType, this is equivalent to "Royal Garden"^100, you don't
have to do anything special.

edismax has a way to do this via configuration.

Hope this helps,
Erick

On Wed, Oct 1, 2014 at 1:32 PM, PeterKerk <pe...@hotmail.com> wrote:
> Hi Erick,
>
> Thanks for clarifying some of this :)
>
> That triggers a few more questions:
>
> 1. I have no df" setting in my solrconfig.xml file at all, nor do I see a
> <requestHandler name="/select" anywhere. How would this typically
> look?
> 2. My site is in 2 languages, Dutch and English. So I now added the Dutch
> stopwords like below to my field definition. However, I also want to exclude
> English stopwords...does that mean I need to define this field definition
> for each language or can I add stopwords for multiple languages in the same
> field definition?
>
>         <fieldType name="searchtext" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                  <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt"/>
>                  <filter class="solr.LowerCaseFilterFactory"/>
>                  <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="20" side="front" />
>       </analyzer>
>       <analyzer type="query">
>                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt"/>
>                  <filter class="solr.LowerCaseFilterFactory"/>
>                  <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="20" side="front" />
>       </analyzer>
>     </fieldType>
>
> 3. fq:the AND Royal AND Garden works indeed, but how would I go about to
> make sure that in that query
>         a. "the" is ignored
>         b. "The Royal Garden" is returned as the 1st result since it's an exact
> match and "Royal" as the 2nd results since it's a partial match (on
> non-stopwords)? I guess that would be via the ranking you mention, but where
> to configure that for my usecase? I have seen weights on results by using
> the ^ operator, e.g. &qf=title_search^20.0+province^15+city_search^10.0 but
> I doubt that is the way to go here.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162200.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by PeterKerk <pe...@hotmail.com>.

Hi Erick,

Thanks for clarifying some of this :)

That triggers a few more questions:

1. I have no df" setting in my solrconfig.xml file at all, nor do I see a
<requestHandler name=&quot;/select&quot; anywhere. How would this typically
look? 
2. My site is in 2 languages, Dutch and English. So I now added the Dutch
stopwords like below to my field definition. However, I also want to exclude
English stopwords...does that mean I need to define this field definition
for each language or can I add stopwords for multiple languages in the same
field definition?

	&lt;fieldType name=&quot;searchtext&quot; class=&quot;solr.TextField&quot;
positionIncrementGap=&quot;100&quot;>
      <analyzer type="index">
		 <tokenizer class="solr.WhitespaceTokenizerFactory"/>       
		 <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_dutch.txt"/>		 
		 <filter class="solr.LowerCaseFilterFactory"/> 
		 <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
maxGramSize="20" side="front" /> 
      </analyzer>
      <analyzer type="query">
		 <tokenizer class="solr.WhitespaceTokenizerFactory"/>   
		<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_dutch.txt"/>				 
		 <filter class="solr.LowerCaseFilterFactory"/> 
		 <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
maxGramSize="20" side="front" /> 
      </analyzer>
    </fieldType>

3. fq:the AND Royal AND Garden works indeed, but how would I go about to
make sure that in that query
	a. "the" is ignored
	b. "The Royal Garden" is returned as the 1st result since it's an exact
match and "Royal" as the 2nd results since it's a partial match (on
non-stopwords)?	I guess that would be via the ranking you mention, but where
to configure that for my usecase? I have seen weights on results by using
the ^ operator, e.g. &qf=title_search^20.0+province^15+city_search^10.0 but
I doubt that is the way to go here.



--
View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162200.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flexible search field analyser/tokenizer configuration

Posted by Erick Erickson <er...@gmail.com>.

There's some confusion here.

First of all, you shouldn't be getting docs like "The Wall" at all,
_assuming_ your fq clause is meant to only include docs with
"the Royal Garden" in the results list. What's happening here is
that the text is being searched for in the default search field, which
will be the "df" setting in your solrconfig.xml file for the /select
request handler.

If that's not germane, then I suspect two things:
1> you don't have your stopwords set up properly in the
<fieldType> definition for the field in question
2> your default operator is OR, in which case try
fq:the AND Royal AND Garden
or set the default operator to AND. (q.op in the /select
request handler in this case).

Second, the fact that these are being returned in the doc is
totally irrelevant to the search process. The text returned
is a verbatim copy of the text sent in. The _indexed_ terms
that are actually searched against may or may not match
these exactly, i.e. the indexed terms may have stopwords
removed, cases folded, stemming performed, etc.

Finally, by using the fq clause combined with the *:* query,
you are completely bypassing ranking. The *:* query is a
"match all docs query", which doesn't bother with scoring.
fq clauses don't contribute to score by definition.

Best,
Erick

On Wed, Oct 1, 2014 at 11:52 AM, PeterKerk <pe...@hotmail.com> wrote:
> Ok, I missed the Query tab where I can do the actual site search :)
>
> I've also used your links, but even with those I fail to grasp why the
> following is happening:
>
> This is my query:
> http://localhost:8983/solr/bm/select?q=*%3A*&fq=The+Royal+Garden&rows=50&fl=id%2Ctitle&wt=xml&indent=true
>
>
> And below the result.
> Notice how results that have "the" in their title are also returned...words
> like "the", "a", "in" in general are words I wish to ignore IF the rest of
> the title does not match.
> And now with my query "The Royal Garden", I have a result that is an exact
> match on all 3 words, but that result is listed all the way at the bottom.
> How can I prevent:
>
> a) make sure that items that only share the words I want to ignore "the",
> "a" etc. are not being returned
> b) make sure that the exact match is at the top of the results and only
> after that the partial matches, so that the 1st results would be "The Royal
> Garden" and the 2nd result would be "Royal"
>
> Thanks!
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
>
> <lst name="responseHeader">
>   <int name="status">0</int>
>   <int name="QTime">1</int>
>   <lst name="params">
>     <str name="fl">id,title</str>
>     <str name="indent">true</str>
>     <str name="q">*:*</str>
>     <str name="_">1412188632532</str>
>     <str name="wt">xml</str>
>     <str name="fq">The Royal Garden</str>
>     <str name="rows">60</str>
>   </lst>
> </lst>
> <result name="response" numFound="9" start="0">
>   <doc>
>     <str name="id">1579</str>
>     <str name="title">Royal</str></doc>
>   <doc>
>     <str name="id">1603</str>
>     <str name="title">The Blue Lagoon</str></doc>
>   <doc>
>     <str name="id">1629</str>
>     <str name="title">The Nightingale DJ Light Sound Vision</str></doc>
>   <doc>
>     <str name="id">1648</str>
>     <str name="title">The Swingmasters</str></doc>
>   <doc>
>     <str name="id">2431</str>
>     <str name="title">The Cover Band</str></doc>
>   <doc>
>     <str name="id">2457</str>
>     <str name="title">The Teahouse Company</str></doc>
>   <doc>
>     <str name="id">2493</str>
>     <str name="title">The Task - Ultimate Party Band</str></doc>
>   <doc>
>     <str name="id">2499</str>
>     <str name="title">The Royal Garden</str></doc>
>   <doc>
>     <str name="id">2500</str>
>     <str name="title">The Wall</str></doc>
> </result>
> </response>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162174.html
> Sent from the Solr - User mailing list archive at Nabble.com.