You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Judioo <co...@judioo.com> on 2011/06/15 10:23:47 UTC

Boost Strangeness

Hi

I'm confused about exactly how boosts relevancy scores work.

Apologies if I am violating this groups etiquette but I could not find
solr's paste bin anywhere.

I have 2 document types but want to return any documents where the requested
ID appears. The ID appears in multiple attributes but I want to boost
results based on which attribute contains the ID.

so my query is

q="id:b007vty6 parent_id:b007vty6 brand_container_id:b007vty6
series_container_id:b007vty6 subseries_container_id:b007vty6
clip_container_id:b007vty6 clip_episode_id:b007vty6"

and I use qf to boost fields

qf="id^10 parent_id^9 brand_container_id^8 series_container_id^8
subseries_container_id^8 clip_container_id^1 clip_episode_id^1"


I expect any document with the following "id:b007vty6" to be returned 1st (
with the highest score ) yet this is not the case. Can anyone explain why
this is? Could it be that


extra info below:

complete URL

/solr/select/?q=id:b007vty6%20parent_id:b007vty6%20brand_container_id:b007vty6%20series_container_id:b007vty6%20subseries_container_id:b007vty6%20clip_container_id:b007vty6%20clip_episode_id:b007vty6&start=0&rows=10&wt=json&indent=on&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1

results

{

   - -
   responseHeader: {
      - status: 0
      - QTime: 12
      - -
      params: {
         - debugQuery: "on"
         - fl:
         "id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score"
         - indent: "on"
         - start: "0"
         - q: "id:b007vty6 parent_id:b007vty6 brand_container_id:b007vty6
         series_container_id:b007vty6 subseries_container_id:b007vty6
         clip_container_id:b007vty6 clip_episode_id:b007vty6"
         - qf: "id^10 parent_id^9 brand_container_id^8 series_container_id^8
         subseries_container_id^8 clip_container_id^1 clip_episode_id^1"
         - wt: "json"
         - rows: "10"
      }
   }
   - -
   response: {
      - numFound: 2
      - start: 0
      - maxScore: 1.5543144
      - -
      docs: [
         - -
         {
            - series_container_id: "b007vm94"
            - id: "b007vsvm"
            - brand_container_id: "b007hhk5"
            - subseries_container_id: "b007vty6"
            - clip_episode_id: ""
            - score: 1.5543144
         }
         - -
         {
            - parent_id: "b007vm94"
            - id: "b007vty6"
            - score: 0.3014368
         }
      ]
   }
   - -
   debug: {
      - rawquerystring: "id:b007vty6 parent_id:b007vty6
      brand_container_id:b007vty6 series_container_id:b007vty6
      subseries_container_id:b007vty6 clip_container_id:b007vty6
      clip_episode_id:b007vty6"
      - querystring: "id:b007vty6 parent_id:b007vty6
      brand_container_id:b007vty6 series_container_id:b007vty6
      subseries_container_id:b007vty6 clip_container_id:b007vty6
      clip_episode_id:b007vty6"
      - parsedquery: "id:b007vty6 PhraseQuery(parent_id:"b 007 vty 6")
      PhraseQuery(brand_container_id:"b 007 vty 6")
      PhraseQuery(series_container_id:"b 007 vty 6")
      PhraseQuery(subseries_container_id:"b 007 vty 6")
      PhraseQuery(clip_container_id:"b 007 vty 6")
PhraseQuery(clip_episode_id:"b
      007 vty 6")"
      - parsedquery_toString: "id:b007vty6 parent_id:"b 007 vty 6"
      brand_container_id:"b 007 vty 6" series_container_id:"b 007 vty 6"
      subseries_container_id:"b 007 vty 6" clip_container_id:"b 007 vty 6"
      clip_episode_id:"b 007 vty 6""
      - -
      explain: {
         - b007vsvm: " 1.5543144 = (MATCH) product of: 10.8802 = (MATCH) sum
         of: 10.8802 = (MATCH) weight(subseries_container_id:"b 007
vty 6" in 39526),
         product of: 0.43911988 =
queryWeight(subseries_container_id:"b 007 vty 6"),
         product of: 49.55458 = idf(subseries_container_id: b=547
007=31 vty=1 6=87)
         0.008861338 = queryNorm 24.77729 =
fieldWeight(subseries_container_id:"b 007
         vty 6" in 39526), product of: 1.0 = tf(phraseFreq=1.0) 49.55458 =
         idf(subseries_container_id: b=547 007=31 vty=1 6=87) 0.5 =
         fieldNorm(field=subseries_container_id, doc=39526) 0.14285715
= coord(1/7) "
         - b007vty6: " 0.3014368 = (MATCH) product of: 2.1100576 = (MATCH)
         sum of: 2.1100576 = (MATCH) weight(id:b007vty6 in 39512), product of:
         0.13674039 = queryWeight(id:b007vty6), product of: 15.431123 =
         idf(docFreq=1, maxDocs=3701577) 0.008861338 = queryNorm
15.431123 = (MATCH)
         fieldWeight(id:b007vty6 in 39512), product of: 1.0 =
         tf(termFreq(id:b007vty6)=1) 15.431123 = idf(docFreq=1,
maxDocs=3701577) 1.0
         = fieldNorm(field=id, doc=39512) 0.14285715 = coord(1/7) "
      }
      - QParser: "LuceneQParser"
      - -
      timing: {
         - time: 12
         - -
         prepare: {
            - time: 3
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 3
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 0
            }
         }
         - -
         process: {
            - time: 9
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 9
            }
         }
      }
   }

}

Re: Boost Strangeness

Posted by Judioo <co...@judioo.com>.

String also does not seem to accept spaces. currently the _id fields can
contain multiple ids ( using as a multiType alternative ). This is why I
used the text type.

On 15 June 2011 12:16, Judioo <co...@judioo.com> wrote:

>    <dynamicField name="*_id"  type="text"    indexed="true"
> stored="true"/>
>
> so all attributes except 'id' are of type text.
>
> I didn't know that about the string type. So is my problem as described (
> that partial matches are contributing to the calculation ) and does defining
> the filed type as string solve this problem.
>
> Or is my understanding completely incorrect?
>
> Thanks in advance
>
>
> On 15 June 2011 12:08, Ahmet Arslan <io...@yahoo.com> wrote:
>
>> >
>> /solr/select/?q=b007vty6&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on
>> >
>> >
>> > same result ( just higher scores ). It's almost as if
>> > partial matches on
>> > brand|series_container_id and id are being considered in
>> > the 1st document.
>> > Surely this can't be right / expected?
>>
>> What is your fieldType definition? Don't you think it is better to use
>> string type which is not tokenized?
>>
>
>

Boost Strangeness

Posted by Judioo <co...@judioo.com>.

WONDERFUL!
Just reporting back.
This document is ACE

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

For explaining what the filters are and how to affect the analyzer.

Erik your statement "First, boosting isn't absolute"  played on me so
I continued to investigate boosting.

I found this document that ( at last ) explains the dismax logic

http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

The reason why I was not getting the order I require was due to:
A)  my boost metrics were too close together.
b) similar id's in a document affected the score


It seems that if a partial match is made the product ( a % of the
total boost ) contributes to the documents score.
This meant that one type of document in the index had a higher
aggregate score due to the fact it had all but one of the boosted
fields ( does not have parent_id ) in it and the fields where
populated with content that was *very* similar to the requested id.

for example

required id = b011mg62
X_id = b011mgsf

Due to the partial matching and closeness of the boost ranges this
type of document always aquired a higher score than another document
with just one matching field ( i.e. id field ).

My solution was to increase the value of the fields I wanted to *really* count

id^100000 parent_id^5000 brand_container_id^500 ....

As a result even if there are similar matches in any field the id and
parent_id matches should always receive a higher boost.


This was also useful
http://stackoverflow.com/questions/2179497/adding-date-boosting-to-complex-solr-queries


Thanks for the help!

Re: Boost Strangeness

Posted by Erick Erickson <er...@gmail.com>.

Right, if you've only changed WordDelimiterFilterFactory in the query, then
then tokens you're analyzing may be split up. Try running some of the
terms through the admin/analysis page.... Unless you have
"catenateAll=1", in the definition, the whole term won't be there....

It becomes a question of why you even want WDFF in there in the first
place, do you ever want to split these fields up this way? Maybe start
by just taking it out completely?

Best
Erick

On Thu, Jun 16, 2011 at 9:55 AM, Judioo <co...@judioo.com> wrote:
> fascinating!!!!
>
> Thank you so much Erik, I'm slowly beginning to understand.
>
> SO I've discovered that by defining 'splitOnNumerics="0"' on the filter
> class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I
> can get *closer* to my required goal!
>
> Now something else odd is occuring.
>
> It only returns 2 results where there is over 70?
>
> Why is that? I can't find were this is explained :(
>
> query
>
> /solr/select?omitNorms=true&q=b006m86d&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on&omitNorms=true
>
> output
>
> {
>
>   - -
>   responseHeader: {
>      - status: 0
>      - QTime: 51
>      - -
>      params: {
>         - debugQuery: "on"
>         - fl:
>         "type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score"
>         - indent: "on"
>         - q: "b006m86d"
>         - qf: "id^10 parent_id^9 brand_container_id^8 series_container_id^8
>         subseries_container_id^8 clip_container_id^1 clip_episode_id^1"
>         - wt: "json"
>         - -
>         omitNorms: [
>            - "true"
>            - "true"
>         ]
>         - defType: "dismax"
>      }
>   }
>   - -
>   response: {
>      - numFound: 2
>      - start: 0
>      - maxScore: 13.473297
>      - -
>      docs: [
>         - -
>         {
>            - parent_id: ""
>            - id: "b006m86d"
>            - type: "brand"
>            - score: 13.473297
>         }
>         - -
>         {
>            - series_container_id: ""
>            - id: "b00y1w9h"
>            - type: "episode"
>            - brand_container_id: "b006m86d"
>            - subseries_container_id: ""
>            - clip_episode_id: ""
>            - score: 11.437143
>         }
>      ]
>   }
>   - -
>   debug: {
>      - rawquerystring: "b006m86d"
>      - querystring: "b006m86d"
>      - parsedquery: "+DisjunctionMaxQuery((id:b006m86d^10.0 |
>      clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 |
>      series_container_id:b006m86d^8.0 | clip_container_id:b006m86d |
>      brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) ()"
>      - parsedquery_toString: "+(id:b006m86d^10.0 | clip_episode_id:b006m86d
>      | subseries_container_id:b006m86d^8.0 |
> series_container_id:b006m86d^8.0 |
>      clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 |
>      parent_id:b006m86d^9.0) ()"
>      - -
>      explain: {
>         - b006m86d: " 13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max
>         of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636),
> product of: 1.0 =
>         tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2,
> maxDocs=783800) 1.0 =
>         fieldNorm(field=id, doc=27636) "
>         - b00y1w9h: " 11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max
>         of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61),
>         product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0),
>         product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800)
>         0.007422088 = queryNorm 13.878762 = (MATCH)
>         fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 =
>         tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1,
>         maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) "
>      }
>      - QParser: "DisMaxQParser"
>      - altquerystring: null
>      - boostfuncs: null
>      - -
>      timing: {
>         - time: 51
>         - -
>         prepare: {
>            - time: 6
>            - -
>            org.apache.solr.handler.component.QueryComponent: {
>               - time: 5
>            }
>            - -
>            org.apache.solr.handler.component.FacetComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.MoreLikeThisComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.HighlightComponent: {
>               - time: 1
>            }
>            - -
>            org.apache.solr.handler.component.StatsComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.DebugComponent: {
>               - time: 0
>            }
>         }
>         - -
>         process: {
>            - time: 45
>            - -
>            org.apache.solr.handler.component.QueryComponent: {
>               - time: 27
>            }
>            - -
>            org.apache.solr.handler.component.FacetComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.MoreLikeThisComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.HighlightComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.StatsComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.DebugComponent: {
>               - time: 18
>            }
>         }
>      }
>   }
>
> }
>
>
> On 15 June 2011 13:16, Erick Erickson <er...@gmail.com> wrote:
>
>> First off, you didn't "violate groups ettiquette". In fact, yours was
>> one of the better first posts in terms or providing enough information
>> for us to actually help!
>>
>> A very useful page is the admin/analysis page to see how the
>> analysis chain works. For instance, if you haven't changed the
>> field type (i.e. <fieldType name="text">) that your input is
>> being broken up by WordDelimiterFilterFactory. Be sure to check
>> the "verbose" checkbox and enter text in both the query and
>> index boxes!
>>
>> Here's an invaluable page, though do note that it's not exhaustive:
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>>
>> But on to your problem:
>>
>> First, boosting isn't absolute, boosting terms just tends to
>> bubble things up, you have to experiment with various weights....
>>
>> To get the full comparison for both documents you're curious about,
>> try using "explainOther". see:
>>
>> http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_doesn.27t_document_id:juggernaut_appear_in_the_top_10_results_for_my_query
>>
>> If you use that against the two docs in question, you should
>> see (although it's a hard read!) the reason the docs got
>> their relative scores.
>>
>> Finally, your next e-mail hints at what's happening. If you're
>> putting multiple tokens in some of these fields, the length
>> normalization may be causing the matches to score lower. You can
>> try disabling those calculations (omitNorms="true" in your field
>> definition).
>> See:
>>
>> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
>>
>> String types accept spaces just fine, but you might want to define
>> the fields with 'multiValued="true" ' and index each as a separate
>> field (note that won't work with a field that's also your <uniqueKey>).
>>
>> Best
>> Erick
>>
>> On Wed, Jun 15, 2011 at 7:16 AM, Judioo <co...@judioo.com> wrote:
>> >   <dynamicField name="*_id"  type="text"    indexed="true"
>>  stored="true"/>
>> >
>> > so all attributes except 'id' are of type text.
>> >
>> > I didn't know that about the string type. So is my problem as described (
>> > that partial matches are contributing to the calculation ) and does
>> defining
>> > the filed type as string solve this problem.
>> >
>> > Or is my understanding completely incorrect?
>> >
>> > Thanks in advance
>> >
>> > On 15 June 2011 12:08, Ahmet Arslan <io...@yahoo.com> wrote:
>> >
>> >> >
>> >>
>> /solr/select/?q=b007vty6&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on
>> >> >
>> >> >
>> >> > same result ( just higher scores ). It's almost as if
>> >> > partial matches on
>> >> > brand|series_container_id and id are being considered in
>> >> > the 1st document.
>> >> > Surely this can't be right / expected?
>> >>
>> >> What is your fieldType definition? Don't you think it is better to use
>> >> string type which is not tokenized?
>> >>
>> >
>>
>

Re: Boost Strangeness

Posted by Judioo <co...@judioo.com>.

fascinating!!!!

Thank you so much Erik, I'm slowly beginning to understand.

SO I've discovered that by defining 'splitOnNumerics="0"' on the filter
class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I
can get *closer* to my required goal!

Now something else odd is occuring.

It only returns 2 results where there is over 70?

Why is that? I can't find were this is explained :(

query

/solr/select?omitNorms=true&q=b006m86d&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on&omitNorms=true

output

{

   - -
   responseHeader: {
      - status: 0
      - QTime: 51
      - -
      params: {
         - debugQuery: "on"
         - fl:
         "type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score"
         - indent: "on"
         - q: "b006m86d"
         - qf: "id^10 parent_id^9 brand_container_id^8 series_container_id^8
         subseries_container_id^8 clip_container_id^1 clip_episode_id^1"
         - wt: "json"
         - -
         omitNorms: [
            - "true"
            - "true"
         ]
         - defType: "dismax"
      }
   }
   - -
   response: {
      - numFound: 2
      - start: 0
      - maxScore: 13.473297
      - -
      docs: [
         - -
         {
            - parent_id: ""
            - id: "b006m86d"
            - type: "brand"
            - score: 13.473297
         }
         - -
         {
            - series_container_id: ""
            - id: "b00y1w9h"
            - type: "episode"
            - brand_container_id: "b006m86d"
            - subseries_container_id: ""
            - clip_episode_id: ""
            - score: 11.437143
         }
      ]
   }
   - -
   debug: {
      - rawquerystring: "b006m86d"
      - querystring: "b006m86d"
      - parsedquery: "+DisjunctionMaxQuery((id:b006m86d^10.0 |
      clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 |
      series_container_id:b006m86d^8.0 | clip_container_id:b006m86d |
      brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) ()"
      - parsedquery_toString: "+(id:b006m86d^10.0 | clip_episode_id:b006m86d
      | subseries_container_id:b006m86d^8.0 |
series_container_id:b006m86d^8.0 |
      clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 |
      parent_id:b006m86d^9.0) ()"
      - -
      explain: {
         - b006m86d: " 13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max
         of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636),
product of: 1.0 =
         tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2,
maxDocs=783800) 1.0 =
         fieldNorm(field=id, doc=27636) "
         - b00y1w9h: " 11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max
         of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61),
         product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0),
         product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800)
         0.007422088 = queryNorm 13.878762 = (MATCH)
         fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 =
         tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1,
         maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) "
      }
      - QParser: "DisMaxQParser"
      - altquerystring: null
      - boostfuncs: null
      - -
      timing: {
         - time: 51
         - -
         prepare: {
            - time: 6
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 5
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 1
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 0
            }
         }
         - -
         process: {
            - time: 45
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 27
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 18
            }
         }
      }
   }

}


On 15 June 2011 13:16, Erick Erickson <er...@gmail.com> wrote:

> First off, you didn't "violate groups ettiquette". In fact, yours was
> one of the better first posts in terms or providing enough information
> for us to actually help!
>
> A very useful page is the admin/analysis page to see how the
> analysis chain works. For instance, if you haven't changed the
> field type (i.e. <fieldType name="text">) that your input is
> being broken up by WordDelimiterFilterFactory. Be sure to check
> the "verbose" checkbox and enter text in both the query and
> index boxes!
>
> Here's an invaluable page, though do note that it's not exhaustive:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
>
> But on to your problem:
>
> First, boosting isn't absolute, boosting terms just tends to
> bubble things up, you have to experiment with various weights....
>
> To get the full comparison for both documents you're curious about,
> try using "explainOther". see:
>
> http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_doesn.27t_document_id:juggernaut_appear_in_the_top_10_results_for_my_query
>
> If you use that against the two docs in question, you should
> see (although it's a hard read!) the reason the docs got
> their relative scores.
>
> Finally, your next e-mail hints at what's happening. If you're
> putting multiple tokens in some of these fields, the length
> normalization may be causing the matches to score lower. You can
> try disabling those calculations (omitNorms="true" in your field
> definition).
> See:
>
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
>
> String types accept spaces just fine, but you might want to define
> the fields with 'multiValued="true" ' and index each as a separate
> field (note that won't work with a field that's also your <uniqueKey>).
>
> Best
> Erick
>
> On Wed, Jun 15, 2011 at 7:16 AM, Judioo <co...@judioo.com> wrote:
> >   <dynamicField name="*_id"  type="text"    indexed="true"
>  stored="true"/>
> >
> > so all attributes except 'id' are of type text.
> >
> > I didn't know that about the string type. So is my problem as described (
> > that partial matches are contributing to the calculation ) and does
> defining
> > the filed type as string solve this problem.
> >
> > Or is my understanding completely incorrect?
> >
> > Thanks in advance
> >
> > On 15 June 2011 12:08, Ahmet Arslan <io...@yahoo.com> wrote:
> >
> >> >
> >>
> /solr/select/?q=b007vty6&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on
> >> >
> >> >
> >> > same result ( just higher scores ). It's almost as if
> >> > partial matches on
> >> > brand|series_container_id and id are being considered in
> >> > the 1st document.
> >> > Surely this can't be right / expected?
> >>
> >> What is your fieldType definition? Don't you think it is better to use
> >> string type which is not tokenized?
> >>
> >
>

Re: Boost Strangeness

Posted by Erick Erickson <er...@gmail.com>.

First off, you didn't "violate groups ettiquette". In fact, yours was
one of the better first posts in terms or providing enough information
for us to actually help!

A very useful page is the admin/analysis page to see how the
analysis chain works. For instance, if you haven't changed the
field type (i.e. <fieldType name="text">) that your input is
being broken up by WordDelimiterFilterFactory. Be sure to check
the "verbose" checkbox and enter text in both the query and
index boxes!

Here's an invaluable page, though do note that it's not exhaustive:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

But on to your problem:

First, boosting isn't absolute, boosting terms just tends to
bubble things up, you have to experiment with various weights....

To get the full comparison for both documents you're curious about,
try using "explainOther". see:
http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_doesn.27t_document_id:juggernaut_appear_in_the_top_10_results_for_my_query

If you use that against the two docs in question, you should
see (although it's a hard read!) the reason the docs got
their relative scores.

Finally, your next e-mail hints at what's happening. If you're
putting multiple tokens in some of these fields, the length
normalization may be causing the matches to score lower. You can
try disabling those calculations (omitNorms="true" in your field definition).
See:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

String types accept spaces just fine, but you might want to define
the fields with 'multiValued="true" ' and index each as a separate
field (note that won't work with a field that's also your <uniqueKey>).

Best
Erick

On Wed, Jun 15, 2011 at 7:16 AM, Judioo <co...@judioo.com> wrote:
>   <dynamicField name="*_id"  type="text"    indexed="true"  stored="true"/>
>
> so all attributes except 'id' are of type text.
>
> I didn't know that about the string type. So is my problem as described (
> that partial matches are contributing to the calculation ) and does defining
> the filed type as string solve this problem.
>
> Or is my understanding completely incorrect?
>
> Thanks in advance
>
> On 15 June 2011 12:08, Ahmet Arslan <io...@yahoo.com> wrote:
>
>> >
>> /solr/select/?q=b007vty6&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on
>> >
>> >
>> > same result ( just higher scores ). It's almost as if
>> > partial matches on
>> > brand|series_container_id and id are being considered in
>> > the 1st document.
>> > Surely this can't be right / expected?
>>
>> What is your fieldType definition? Don't you think it is better to use
>> string type which is not tokenized?
>>
>

Re: Boost Strangeness

Posted by Judioo <co...@judioo.com>.

   <dynamicField name="*_id"  type="text"    indexed="true"  stored="true"/>

so all attributes except 'id' are of type text.

I didn't know that about the string type. So is my problem as described (
that partial matches are contributing to the calculation ) and does defining
the filed type as string solve this problem.

Or is my understanding completely incorrect?

Thanks in advance

On 15 June 2011 12:08, Ahmet Arslan <io...@yahoo.com> wrote:

> >
> /solr/select/?q=b007vty6&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on
> >
> >
> > same result ( just higher scores ). It's almost as if
> > partial matches on
> > brand|series_container_id and id are being considered in
> > the 1st document.
> > Surely this can't be right / expected?
>
> What is your fieldType definition? Don't you think it is better to use
> string type which is not tokenized?
>

Re: Boost Strangeness

Posted by Ahmet Arslan <io...@yahoo.com>.

> /solr/select/?q=b007vty6&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on
> 
> 
> same result ( just higher scores ). It's almost as if 
> partial matches on
> brand|series_container_id and id are being considered in
> the 1st document.
> Surely this can't be right / expected?

What is your fieldType definition? Don't you think it is better to use string type which is not tokenized?

Re: Boost Strangeness

Posted by Judioo <co...@judioo.com>.

Apologies
I have tried that method as well.

/solr/select/?q=b007vty6&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on


same result ( just higher scores ). It's almost as if  partial matches on
brand|series_container_id and id are being considered in the 1st document.
Surely this can't be right / expected?

{

   - -
   responseHeader: {
      - status: 0
      - QTime: 13
      - -
      params: {
         - debugQuery: "on"
         - fl:
         "id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score"
         - indent: "on"
         - q: "b007vty6"
         - qf: "id^10 parent_id^9 brand_container_id^8 series_container_id^8
         subseries_container_id^8 clip_container_id^1 clip_episode_id^1"
         - wt: "json"
         - defType: "dismax"
      }
   }
   - -
   response: {
      - numFound: 2
      - start: 0
      - maxScore: 21.138214
      - -
      docs: [
         - -
         {
            - series_container_id: "b007vm94"
            - id: "b007vsvm"
            - brand_container_id: "b007hhk5"
            - subseries_container_id: "b007vty6"
            - clip_episode_id: ""
            - score: 21.138214
         }
         - -
         {
            - parent_id: "b007vm94"
            - id: "b007vty6"
            - score: 5.1243143
         }
      ]
   }
   - -
   debug: {
      - rawquerystring: "b007vty6"
      - querystring: "b007vty6"
      - parsedquery: "+DisjunctionMaxQuery((id:b007vty6^10.0 |
      clip_episode_id:"b 007 vty 6" | subseries_container_id:"b 007
vty 6"^8.0 |
      series_container_id:"b 007 vty 6"^8.0 | clip_container_id:"b 007 vty 6" |
      brand_container_id:"b 007 vty 6"^8.0 | parent_id:"b 007 vty 6"^9.0)) ()"
      - parsedquery_toString: "+(id:b007vty6^10.0 | clip_episode_id:"b 007
      vty 6" | subseries_container_id:"b 007 vty 6"^8.0 |
series_container_id:"b
      007 vty 6"^8.0 | clip_container_id:"b 007 vty 6" |
brand_container_id:"b 007
      vty 6"^8.0 | parent_id:"b 007 vty 6"^9.0) ()"
      - -
      explain: {
         - b007vsvm: " 21.138214 = (MATCH) sum of: 21.138214 = (MATCH) max
         of: 21.138214 = (MATCH) weight(subseries_container_id:"b 007
vty 6"^8.0 in
         39526), product of: 0.85312855 =
queryWeight(subseries_container_id:"b 007
         vty 6"^8.0), product of: 8.0 = boost 49.55458 =
idf(subseries_container_id:
         b=547 007=31 vty=1 6=87) 0.0021519922 = queryNorm 24.77729 =
         fieldWeight(subseries_container_id:"b 007 vty 6" in 39526),
product of: 1.0
         = tf(phraseFreq=1.0) 49.55458 = idf(subseries_container_id:
b=547 007=31
         vty=1 6=87) 0.5 = fieldNorm(field=subseries_container_id, doc=39526) "
         - b007vty6: " 5.1243143 = (MATCH) sum of: 5.1243143 = (MATCH) max
         of: 5.1243143 = (MATCH) weight(id:b007vty6^10.0 in 39512), product of:
         0.33207658 = queryWeight(id:b007vty6^10.0), product of: 10.0 = boost
         15.431123 = idf(docFreq=1, maxDocs=3701577) 0.0021519922 = queryNorm
         15.431123 = (MATCH) fieldWeight(id:b007vty6 in 39512),
product of: 1.0 =
         tf(termFreq(id:b007vty6)=1) 15.431123 = idf(docFreq=1,
maxDocs=3701577) 1.0
         = fieldNorm(field=id, doc=39512) "
      }
      - QParser: "DisMaxQParser"
      - altquerystring: null
      - boostfuncs: null
      - -
      timing: {
         - time: 13
         - -
         prepare: {
            - time: 3
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 3
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 0
            }
         }
         - -
         process: {
            - time: 10
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 10
            }
         }
      }
   }

}



On 15 June 2011 09:39, Ahmet Arslan <io...@yahoo.com> wrote:

> > I have 2 document types but want to return any documents
> > where the requested
> > ID appears. The ID appears in multiple attributes but I
> > want to boost
> > results based on which attribute contains the ID.
> >
> > so my query is
> >
> > q="id:b007vty6 parent_id:b007vty6
> > brand_container_id:b007vty6
> > series_container_id:b007vty6
> > subseries_container_id:b007vty6
> > clip_container_id:b007vty6 clip_episode_id:b007vty6"
> >
> > and I use qf to boost fields
> >
> > qf="id^10 parent_id^9 brand_container_id^8
> > series_container_id^8
> > subseries_container_id^8 clip_container_id^1
> > clip_episode_id^1"
> >
>
> There is a misunderstanding here. qf parameter is specific to (e)dismax
> query parser plugin. For more information about it please see:
>
> http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
>
> Your query string can be something like this:
>
> defType=dismax&q=b007vty6&qf="id^10 parent_id^9 brand_container_id^8 ...
>
> It automatically expands your simple word query to multiple fields.
> defType=dismax is a must to enable it, either in URL or in solrconfig.xml
> (defaults section).
>

Re: Boost Strangeness

Posted by Ahmet Arslan <io...@yahoo.com>.

> I have 2 document types but want to return any documents
> where the requested
> ID appears. The ID appears in multiple attributes but I
> want to boost
> results based on which attribute contains the ID.
> 
> so my query is
> 
> q="id:b007vty6 parent_id:b007vty6
> brand_container_id:b007vty6
> series_container_id:b007vty6
> subseries_container_id:b007vty6
> clip_container_id:b007vty6 clip_episode_id:b007vty6"
> 
> and I use qf to boost fields
> 
> qf="id^10 parent_id^9 brand_container_id^8
> series_container_id^8
> subseries_container_id^8 clip_container_id^1
> clip_episode_id^1"
> 

There is a misunderstanding here. qf parameter is specific to (e)dismax query parser plugin. For more information about it please see:

http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

Your query string can be something like this:

defType=dismax&q=b007vty6&qf="id^10 parent_id^9 brand_container_id^8 ...

It automatically expands your simple word query to multiple fields.
defType=dismax is a must to enable it, either in URL or in solrconfig.xml (defaults section).