You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by John Blythe <jo...@curvolabs.com> on 2015/05/18 19:57:07 UTC

Relevancy Scoring

Background:
I'm using Solr as a mechanism for search for users, but before even getting
to that point as a means of intelligent inference more or less. Product
data comes in and we're hoping to match it to the correct known product
without having to use the user for confirmation/search.

Problem:
I get a maxScore (with the correct result at the top) of 618.22626 using
the manufacturer's name, the product number, and the product description.
All of these items are coming from a previous purchaser so we have to
account for manufacturer name variations, miskeying of product numbers, and
variances of descriptions. The maxScore is 772 when I remove the
description.

My initial question is regarding relevancy scoring (
https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many of the
description's tokens will be found throughout the other documents, thus
keeping the relevancy at bay per the IDF portion of the relevancy score. I
suppose the actual question, then, is if a low relevancy score on one field
hurts the rest of them / the cumulative score, or if it simply keep that
field's contribution lower than it'd otherwise be. I thought it was the
latter, but the results I mention above are making me think that the first
scenario is actually the case.

Based on what I hear about the above, a follow up question may be what in
the world is wrong with my analyzer :)

Thanks for any thoughts!

Best,
John

Re: Relevancy Scoring

Posted by John Blythe <jo...@curvolabs.com>.

Awesome, following it now!

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | john@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 8:21 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> Glad you figured things out and found splainer useful! Pull requests, bugs,
> feature requests welcome!
>
> https://github.com/o19s/splainer
>
> Doug
>
> On Monday, May 18, 2015, John Blythe <jo...@curvolabs.com> wrote:
>
> > Doug,
> >
> > very very cool tool you've made there. thanks so much for sharing!
> >
> > i ended up removing the shinglefilterfactory and voila! things are back
> in
> > good, working order with some great matching. i'm not 100% certain as to
> > why shingling was so ineffective. i'm guessing the stacked terms created
> > lower relevancy due to IDF on the *joint *terms/token?
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | john@curvolabs.com <javascript:;>
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, May 18, 2015 at 4:57 PM, John Blythe <john@curvolabs.com
> > <javascript:;>> wrote:
> >
> > > Doug,
> > >
> > > A couple things quickly:
> > > - I'll check in to that. How would you go about testing things, direct
> > > URL? If so, how would you compose one of the examples above?
> > > - yup, I used it extensively before testing scores to ensure that I was
> > > getting things parsed appropriately (segmenting off the unit of measure
> > > [mm] whilst still maintaining the decimal instead of breaking it up was
> > my
> > > largest concern as of late)
> > > - to that point, though, it looks like one of my blunders was in the
> > > synonyms file. i just referenced /analysis/ again and realized "CANN"
> was
> > > being transposed to "cannula" instead of "cannulated" #facepalm
> > > - i'll be GLAD to use that! i'd been trying to use
> > http://explain.solr.pl/
> > > previously but it kept error'ing out on me :\
> > >
> > > thanks again, will report back!
> > >
> > > --
> > > *John Blythe*
> > > Product Manager & Lead Developer
> > >
> > > 251.605.3071 | john@curvolabs.com <javascript:;>
> > > www.curvolabs.com
> > >
> > > 58 Adams Ave
> > > Evansville, IN 47713
> > >
> > > On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull <
> > > dturnbull@opensourceconnections.com <javascript:;>> wrote:
> > >
> > >> Hey John,
> > >>
> > >> I think you likely do need to think about escaping the query
> operators.
> > I
> > >> doubt the Solr admin could tell the difference.
> > >>
> > >> For analysis, have you looked at the handy analysis tool in the Solr
> > Admin
> > >> UI? Its pretty indespensible for figuring out if an analyzed query
> > matches
> > >> an analyzed field.
> > >>
> > >> Outside of that, I can selfishly plug Splainer (http://splainer.io)
> > that
> > >> gives you more insight into the Solr relevance explain. You would
> paste
> > in
> > >> something like
> > >>
> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting)
> > .
> > >>
> > >> Cheers!
> > >> -Doug
> > >>
> > >> On Mon, May 18, 2015 at 3:02 PM, John Blythe <john@curvolabs.com
> > <javascript:;>> wrote:
> > >>
> > >> > Thanks again for the speediness, Doug.
> > >> >
> > >> > Good to know on some of those things, not least of all the +
> > indicating
> > >> a
> > >> > mandatory field and the parentheses. It seems like the escaping is
> > >> pretty
> > >> > robust in light of the product number.
> > >> >
> > >> > I'm thinking it has to be largely related to the analyzer. Check
> this
> > >> out,
> > >> > this time with more of a real world case for us. Searching for
> > >> "descript2:
> > >> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated
> > >> screw PT
> > >> > 4.0x40mm" as its description. There is a document, though, that has
> > the
> > >> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing
> > >> (minus
> > >> > lowercases) rendering that the analyzer is producing (per the
> > /analysis
> > >> > page). Why would 4.0x40 come up first?  The top four results have
> > >> > 4.0x[Something]. It's not till the fifth result that you see a 3.5
> > >> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying
> > >> WTF.
> > >> > So close, but then it ignores the "50" for a "105" instead.
> > >> >
> > >> > Further, adding parenthesis around the phrase—"descript2: (CANN
> SCREW
> > PT
> > >> > 3.5X50MM)"—produces top results that have the correct
> > >> dimensions—3.5x50—but
> > >> > the wrong type. Instead of "cannulated" screws we see "cortical."
> I'm
> > >> > convinced Solr is trolling me at this point :p
> > >> >
> > >> > --
> > >> > *John Blythe*
> > >> > Product Manager & Lead Developer
> > >> >
> > >> > 251.605.3071 | john@curvolabs.com <javascript:;>
> > >> > www.curvolabs.com
> > >> >
> > >> > 58 Adams Ave
> > >> > Evansville, IN 47713
> > >> >
> > >> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
> > >> > dturnbull@opensourceconnections.com <javascript:;>> wrote:
> > >> >
> > >> > > You might just need some syntax help. Not sure what the Solr admin
> > >> > escapes,
> > >> > > but many of the text in your query actually have reserved meaning.
> > >> Also,
> > >> > > when a term appears without a fieldName:value directly in front of
> > >> it, I
> > >> > > believe its going to search the default field (it's no longer
> > >> attached to
> > >> > > the field). You need to use parens to attach multiple terms to
> that
> > >> field
> > >> > > for search.
> > >> > >
> > >> > > I'd try to see if doing any of the following help:
> > >> > >
> > >> > > Add parens to group terms to the field:
> > >> > >
> > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice
> Cream
> > >> > 1.5pt)
> > >> > > +
> > >> > > productnumber:(001-029-1298)
> > >> > >
> > >> > > Also keep in mind "+" means mandatory, and its an operator on just
> > one
> > >> > > field. So in the above you're requiring description and product
> > number
> > >> > > match the provided terms.
> > >> > >
> > >> > > Further, you may need to escape the "-" as that means "NOT". You
> can
> > >> do
> > >> > > that with the following:
> > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice
> Cream
> > >> > 1.5pt)
> > >> > > +
> > >> > > productnumber:(001\-029\-1298)
> > >> > >
> > >> > > You can read more in the article on Solr query syntax
> > >> > > https://wiki.apache.org/solr/SolrQuerySyntax
> > >> > >
> > >> > > Hope that helps, for all I know your cut and paste didn't work and
> > I'm
> > >> > > assuming you have syntax issues :)
> > >> > >
> > >> > > -Doug
> > >> > >
> > >> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <john@curvolabs.com
> > <javascript:;>>
> > >> wrote:
> > >> > >
> > >> > > > Hey Doug,
> > >> > > >
> > >> > > > Thanks for the quick reply.
> > >> > > >
> > >> > > > No edismax just yet. Planning on getting there, but have been
> > >> trying to
> > >> > > > fine tune the 3 primary fields we use over the last week or so
> > >> before
> > >> > > > jumping into edismax and its nifty toolset to help push our
> > accuracy
> > >> > and
> > >> > > > precision even further (aside: is this a good strategy?)
> > >> > > >
> > >> > > > For now I'm querying directly in the admin interface, doing
> > >> something
> > >> > > like
> > >> > > > this:
> > >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice
> > Cream
> > >> > > 1.5pt +
> > >> > > > productnumber: 001-029-1298
> > >> > > >
> > >> > > > versus
> > >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice
> > Cream
> > >> > 1.5pt
> > >> > > >
> > >> > > > Another interesting and likely related factor is the
> description's
> > >> lack
> > >> > > of
> > >> > > > help. With the product number in place it gets nailed even with
> > >> stray
> > >> > > > zeros, 4's instead of 1's, etc.
> > >> > > >
> > >> > > > Without it, though, the querying just flat out sucks. For
> > instance,
> > >> I
> > >> > > just
> > >> > > > saw something akin to this:
> > >> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream
> > 1.5pt
> > >> > > >
> > >> > > > that got nowhere near what it should have. Straw would have a
> > >> synonym
> > >> > to
> > >> > > > map to strawberry and would match the document's description
> > >> *exactly,
> > >> > > *yet
> > >> > > > Solr would push out all sorts of peripheral suggestions that
> > didn't
> > >> > match
> > >> > > > strawberry or was a different amount (.75pt, for instance). I
> know
> > >> I'm
> > >> > no
> > >> > > > expert, but I was thinking my analyzer was a bit better than
> that
> > :p
> > >> > > >
> > >> > > > --
> > >> > > > *John Blythe*
> > >> > > > Product Manager & Lead Developer
> > >> > > >
> > >> > > > 251.605.3071 | john@curvolabs.com <javascript:;>
> > >> > > > www.curvolabs.com
> > >> > > >
> > >> > > > 58 Adams Ave
> > >> > > > Evansville, IN 47713
> > >> > > >
> > >> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> > >> > > > dturnbull@opensourceconnections.com <javascript:;>> wrote:
> > >> > > >
> > >> > > > > > The maxScore is 772 when I remove the
> > >> > > > > description.
> > >> > > > > > I suppose the actual question, then, is if a low relevancy
> > >> score on
> > >> > > one
> > >> > > > > field
> > >> > > > > hurts the rest of them / the cumulative score,
> > >> > > > >
> > >> > > > > This depends a lot on how you're searching over these fields.
> Is
> > >> > this a
> > >> > > > > (e)dismax query? Or a lucene query? Something else?
> > >> > > > >
> > >> > > > > Across fields there's query normalization, which attempts to
> > take
> > >> a
> > >> > sum
> > >> > > > of
> > >> > > > > squares of IDFs of the search terms across the fields being
> > >> searched.
> > >> > > > > Adding/removing a field could impact query normalization.
> > >> > > > >
> > >> > > > > By removing a field, you also likely remove a boolean clause.
> By
> > >> > > removing
> > >> > > > > the clause, there's less of a chance the coordinating factor
> > >> (known
> > >> > as
> > >> > > > > coord) would punish your relevancy score.
> > >> > > > >
> > >> > > > > Otherwise, don't know -- perhaps you could give us more
> > >> information
> > >> > on
> > >> > > > how
> > >> > > > > you're searching your documents? Perhaps a sample Solr URL
> that
> > >> shows
> > >> > > how
> > >> > > > > you're querying?
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > > --
> > >> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > >> > > Connections,
> > >> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > >> > > > > Author: Relevant Search <http://manning.com/turnbull> from
> > >> Manning
> > >> > > > > Publications
> > >> > > > > This e-mail and all contents, including attachments, is
> > >> considered to
> > >> > > be
> > >> > > > > Company Confidential unless explicitly stated otherwise,
> > >> regardless
> > >> > > > > of whether attachments are marked as such.
> > >> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <
> > john@curvolabs.com <javascript:;>>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Background:
> > >> > > > > > I'm using Solr as a mechanism for search for users, but
> before
> > >> even
> > >> > > > > getting
> > >> > > > > > to that point as a means of intelligent inference more or
> > less.
> > >> > > Product
> > >> > > > > > data comes in and we're hoping to match it to the correct
> > known
> > >> > > product
> > >> > > > > > without having to use the user for confirmation/search.
> > >> > > > > >
> > >> > > > > > Problem:
> > >> > > > > > I get a maxScore (with the correct result at the top) of
> > >> 618.22626
> > >> > > > using
> > >> > > > > > the manufacturer's name, the product number, and the product
> > >> > > > description.
> > >> > > > > > All of these items are coming from a previous purchaser so
> we
> > >> have
> > >> > to
> > >> > > > > > account for manufacturer name variations, miskeying of
> product
> > >> > > numbers,
> > >> > > > > and
> > >> > > > > > variances of descriptions. The maxScore is 772 when I remove
> > the
> > >> > > > > > description.
> > >> > > > > >
> > >> > > > > > My initial question is regarding relevancy scoring (
> > >> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that
> > >> many of
> > >> > > the
> > >> > > > > > description's tokens will be found throughout the other
> > >> documents,
> > >> > > thus
> > >> > > > > > keeping the relevancy at bay per the IDF portion of the
> > >> relevancy
> > >> > > > score.
> > >> > > > > I
> > >> > > > > > suppose the actual question, then, is if a low relevancy
> score
> > >> on
> > >> > one
> > >> > > > > field
> > >> > > > > > hurts the rest of them / the cumulative score, or if it
> simply
> > >> keep
> > >> > > > that
> > >> > > > > > field's contribution lower than it'd otherwise be. I thought
> > it
> > >> was
> > >> > > the
> > >> > > > > > latter, but the results I mention above are making me think
> > that
> > >> > the
> > >> > > > > first
> > >> > > > > > scenario is actually the case.
> > >> > > > > >
> > >> > > > > > Based on what I hear about the above, a follow up question
> may
> > >> be
> > >> > > what
> > >> > > > in
> > >> > > > > > the world is wrong with my analyzer :)
> > >> > > > > >
> > >> > > > > > Thanks for any thoughts!
> > >> > > > > >
> > >> > > > > > Best,
> > >> > > > > > John
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > >> Connections,
> > >> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > >> > > Author: Relevant Search <http://manning.com/turnbull> from
> Manning
> > >> > > Publications
> > >> > > This e-mail and all contents, including attachments, is considered
> > to
> > >> be
> > >> > > Company Confidential unless explicitly stated otherwise,
> regardless
> > >> > > of whether attachments are marked as such.
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > Connections,
> > >> LLC | 240.476.9983 | http://www.opensourceconnections.com
> > >> Author: Relevant Search <http://manning.com/turnbull> from Manning
> > >> Publications
> > >> This e-mail and all contents, including attachments, is considered to
> be
> > >> Company Confidential unless explicitly stated otherwise, regardless
> > >> of whether attachments are marked as such.
> > >>
> > >
> > >
> >
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Relevant Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: Relevancy Scoring

Posted by Doug Turnbull <dt...@opensourceconnections.com>.

Glad you figured things out and found splainer useful! Pull requests, bugs,
feature requests welcome!

https://github.com/o19s/splainer

Doug

On Monday, May 18, 2015, John Blythe <jo...@curvolabs.com> wrote:

> Doug,
>
> very very cool tool you've made there. thanks so much for sharing!
>
> i ended up removing the shinglefilterfactory and voila! things are back in
> good, working order with some great matching. i'm not 100% certain as to
> why shingling was so ineffective. i'm guessing the stacked terms created
> lower relevancy due to IDF on the *joint *terms/token?
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | john@curvolabs.com <javascript:;>
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Mon, May 18, 2015 at 4:57 PM, John Blythe <john@curvolabs.com
> <javascript:;>> wrote:
>
> > Doug,
> >
> > A couple things quickly:
> > - I'll check in to that. How would you go about testing things, direct
> > URL? If so, how would you compose one of the examples above?
> > - yup, I used it extensively before testing scores to ensure that I was
> > getting things parsed appropriately (segmenting off the unit of measure
> > [mm] whilst still maintaining the decimal instead of breaking it up was
> my
> > largest concern as of late)
> > - to that point, though, it looks like one of my blunders was in the
> > synonyms file. i just referenced /analysis/ again and realized "CANN" was
> > being transposed to "cannula" instead of "cannulated" #facepalm
> > - i'll be GLAD to use that! i'd been trying to use
> http://explain.solr.pl/
> > previously but it kept error'ing out on me :\
> >
> > thanks again, will report back!
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | john@curvolabs.com <javascript:;>
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull <
> > dturnbull@opensourceconnections.com <javascript:;>> wrote:
> >
> >> Hey John,
> >>
> >> I think you likely do need to think about escaping the query operators.
> I
> >> doubt the Solr admin could tell the difference.
> >>
> >> For analysis, have you looked at the handy analysis tool in the Solr
> Admin
> >> UI? Its pretty indespensible for figuring out if an analyzed query
> matches
> >> an analyzed field.
> >>
> >> Outside of that, I can selfishly plug Splainer (http://splainer.io)
> that
> >> gives you more insight into the Solr relevance explain. You would paste
> in
> >> something like
> >> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting)
> .
> >>
> >> Cheers!
> >> -Doug
> >>
> >> On Mon, May 18, 2015 at 3:02 PM, John Blythe <john@curvolabs.com
> <javascript:;>> wrote:
> >>
> >> > Thanks again for the speediness, Doug.
> >> >
> >> > Good to know on some of those things, not least of all the +
> indicating
> >> a
> >> > mandatory field and the parentheses. It seems like the escaping is
> >> pretty
> >> > robust in light of the product number.
> >> >
> >> > I'm thinking it has to be largely related to the analyzer. Check this
> >> out,
> >> > this time with more of a real world case for us. Searching for
> >> "descript2:
> >> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated
> >> screw PT
> >> > 4.0x40mm" as its description. There is a document, though, that has
> the
> >> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing
> >> (minus
> >> > lowercases) rendering that the analyzer is producing (per the
> /analysis
> >> > page). Why would 4.0x40 come up first?  The top four results have
> >> > 4.0x[Something]. It's not till the fifth result that you see a 3.5
> >> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying
> >> WTF.
> >> > So close, but then it ignores the "50" for a "105" instead.
> >> >
> >> > Further, adding parenthesis around the phrase—"descript2: (CANN SCREW
> PT
> >> > 3.5X50MM)"—produces top results that have the correct
> >> dimensions—3.5x50—but
> >> > the wrong type. Instead of "cannulated" screws we see "cortical." I'm
> >> > convinced Solr is trolling me at this point :p
> >> >
> >> > --
> >> > *John Blythe*
> >> > Product Manager & Lead Developer
> >> >
> >> > 251.605.3071 | john@curvolabs.com <javascript:;>
> >> > www.curvolabs.com
> >> >
> >> > 58 Adams Ave
> >> > Evansville, IN 47713
> >> >
> >> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
> >> > dturnbull@opensourceconnections.com <javascript:;>> wrote:
> >> >
> >> > > You might just need some syntax help. Not sure what the Solr admin
> >> > escapes,
> >> > > but many of the text in your query actually have reserved meaning.
> >> Also,
> >> > > when a term appears without a fieldName:value directly in front of
> >> it, I
> >> > > believe its going to search the default field (it's no longer
> >> attached to
> >> > > the field). You need to use parens to attach multiple terms to that
> >> field
> >> > > for search.
> >> > >
> >> > > I'd try to see if doing any of the following help:
> >> > >
> >> > > Add parens to group terms to the field:
> >> > >
> >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> >> > 1.5pt)
> >> > > +
> >> > > productnumber:(001-029-1298)
> >> > >
> >> > > Also keep in mind "+" means mandatory, and its an operator on just
> one
> >> > > field. So in the above you're requiring description and product
> number
> >> > > match the provided terms.
> >> > >
> >> > > Further, you may need to escape the "-" as that means "NOT". You can
> >> do
> >> > > that with the following:
> >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> >> > 1.5pt)
> >> > > +
> >> > > productnumber:(001\-029\-1298)
> >> > >
> >> > > You can read more in the article on Solr query syntax
> >> > > https://wiki.apache.org/solr/SolrQuerySyntax
> >> > >
> >> > > Hope that helps, for all I know your cut and paste didn't work and
> I'm
> >> > > assuming you have syntax issues :)
> >> > >
> >> > > -Doug
> >> > >
> >> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <john@curvolabs.com
> <javascript:;>>
> >> wrote:
> >> > >
> >> > > > Hey Doug,
> >> > > >
> >> > > > Thanks for the quick reply.
> >> > > >
> >> > > > No edismax just yet. Planning on getting there, but have been
> >> trying to
> >> > > > fine tune the 3 primary fields we use over the last week or so
> >> before
> >> > > > jumping into edismax and its nifty toolset to help push our
> accuracy
> >> > and
> >> > > > precision even further (aside: is this a good strategy?)
> >> > > >
> >> > > > For now I'm querying directly in the admin interface, doing
> >> something
> >> > > like
> >> > > > this:
> >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice
> Cream
> >> > > 1.5pt +
> >> > > > productnumber: 001-029-1298
> >> > > >
> >> > > > versus
> >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice
> Cream
> >> > 1.5pt
> >> > > >
> >> > > > Another interesting and likely related factor is the description's
> >> lack
> >> > > of
> >> > > > help. With the product number in place it gets nailed even with
> >> stray
> >> > > > zeros, 4's instead of 1's, etc.
> >> > > >
> >> > > > Without it, though, the querying just flat out sucks. For
> instance,
> >> I
> >> > > just
> >> > > > saw something akin to this:
> >> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream
> 1.5pt
> >> > > >
> >> > > > that got nowhere near what it should have. Straw would have a
> >> synonym
> >> > to
> >> > > > map to strawberry and would match the document's description
> >> *exactly,
> >> > > *yet
> >> > > > Solr would push out all sorts of peripheral suggestions that
> didn't
> >> > match
> >> > > > strawberry or was a different amount (.75pt, for instance). I know
> >> I'm
> >> > no
> >> > > > expert, but I was thinking my analyzer was a bit better than that
> :p
> >> > > >
> >> > > > --
> >> > > > *John Blythe*
> >> > > > Product Manager & Lead Developer
> >> > > >
> >> > > > 251.605.3071 | john@curvolabs.com <javascript:;>
> >> > > > www.curvolabs.com
> >> > > >
> >> > > > 58 Adams Ave
> >> > > > Evansville, IN 47713
> >> > > >
> >> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> >> > > > dturnbull@opensourceconnections.com <javascript:;>> wrote:
> >> > > >
> >> > > > > > The maxScore is 772 when I remove the
> >> > > > > description.
> >> > > > > > I suppose the actual question, then, is if a low relevancy
> >> score on
> >> > > one
> >> > > > > field
> >> > > > > hurts the rest of them / the cumulative score,
> >> > > > >
> >> > > > > This depends a lot on how you're searching over these fields. Is
> >> > this a
> >> > > > > (e)dismax query? Or a lucene query? Something else?
> >> > > > >
> >> > > > > Across fields there's query normalization, which attempts to
> take
> >> a
> >> > sum
> >> > > > of
> >> > > > > squares of IDFs of the search terms across the fields being
> >> searched.
> >> > > > > Adding/removing a field could impact query normalization.
> >> > > > >
> >> > > > > By removing a field, you also likely remove a boolean clause. By
> >> > > removing
> >> > > > > the clause, there's less of a chance the coordinating factor
> >> (known
> >> > as
> >> > > > > coord) would punish your relevancy score.
> >> > > > >
> >> > > > > Otherwise, don't know -- perhaps you could give us more
> >> information
> >> > on
> >> > > > how
> >> > > > > you're searching your documents? Perhaps a sample Solr URL that
> >> shows
> >> > > how
> >> > > > > you're querying?
> >> > > > >
> >> > > > > Cheers,
> >> > > > > --
> >> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> >> > > Connections,
> >> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> >> > > > > Author: Relevant Search <http://manning.com/turnbull> from
> >> Manning
> >> > > > > Publications
> >> > > > > This e-mail and all contents, including attachments, is
> >> considered to
> >> > > be
> >> > > > > Company Confidential unless explicitly stated otherwise,
> >> regardless
> >> > > > > of whether attachments are marked as such.
> >> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <
> john@curvolabs.com <javascript:;>>
> >> > > wrote:
> >> > > > >
> >> > > > > > Background:
> >> > > > > > I'm using Solr as a mechanism for search for users, but before
> >> even
> >> > > > > getting
> >> > > > > > to that point as a means of intelligent inference more or
> less.
> >> > > Product
> >> > > > > > data comes in and we're hoping to match it to the correct
> known
> >> > > product
> >> > > > > > without having to use the user for confirmation/search.
> >> > > > > >
> >> > > > > > Problem:
> >> > > > > > I get a maxScore (with the correct result at the top) of
> >> 618.22626
> >> > > > using
> >> > > > > > the manufacturer's name, the product number, and the product
> >> > > > description.
> >> > > > > > All of these items are coming from a previous purchaser so we
> >> have
> >> > to
> >> > > > > > account for manufacturer name variations, miskeying of product
> >> > > numbers,
> >> > > > > and
> >> > > > > > variances of descriptions. The maxScore is 772 when I remove
> the
> >> > > > > > description.
> >> > > > > >
> >> > > > > > My initial question is regarding relevancy scoring (
> >> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that
> >> many of
> >> > > the
> >> > > > > > description's tokens will be found throughout the other
> >> documents,
> >> > > thus
> >> > > > > > keeping the relevancy at bay per the IDF portion of the
> >> relevancy
> >> > > > score.
> >> > > > > I
> >> > > > > > suppose the actual question, then, is if a low relevancy score
> >> on
> >> > one
> >> > > > > field
> >> > > > > > hurts the rest of them / the cumulative score, or if it simply
> >> keep
> >> > > > that
> >> > > > > > field's contribution lower than it'd otherwise be. I thought
> it
> >> was
> >> > > the
> >> > > > > > latter, but the results I mention above are making me think
> that
> >> > the
> >> > > > > first
> >> > > > > > scenario is actually the case.
> >> > > > > >
> >> > > > > > Based on what I hear about the above, a follow up question may
> >> be
> >> > > what
> >> > > > in
> >> > > > > > the world is wrong with my analyzer :)
> >> > > > > >
> >> > > > > > Thanks for any thoughts!
> >> > > > > >
> >> > > > > > Best,
> >> > > > > > John
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> >> Connections,
> >> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> >> > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> >> > > Publications
> >> > > This e-mail and all contents, including attachments, is considered
> to
> >> be
> >> > > Company Confidential unless explicitly stated otherwise, regardless
> >> > > of whether attachments are marked as such.
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> Connections,
> >> LLC | 240.476.9983 | http://www.opensourceconnections.com
> >> Author: Relevant Search <http://manning.com/turnbull> from Manning
> >> Publications
> >> This e-mail and all contents, including attachments, is considered to be
> >> Company Confidential unless explicitly stated otherwise, regardless
> >> of whether attachments are marked as such.
> >>
> >
> >
>


-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search <http://manning.com/turnbull> from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Relevancy Scoring

Posted by John Blythe <jo...@curvolabs.com>.

Doug,

very very cool tool you've made there. thanks so much for sharing!

i ended up removing the shinglefilterfactory and voila! things are back in
good, working order with some great matching. i'm not 100% certain as to
why shingling was so ineffective. i'm guessing the stacked terms created
lower relevancy due to IDF on the *joint *terms/token?

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | john@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 4:57 PM, John Blythe <jo...@curvolabs.com> wrote:

> Doug,
>
> A couple things quickly:
> - I'll check in to that. How would you go about testing things, direct
> URL? If so, how would you compose one of the examples above?
> - yup, I used it extensively before testing scores to ensure that I was
> getting things parsed appropriately (segmenting off the unit of measure
> [mm] whilst still maintaining the decimal instead of breaking it up was my
> largest concern as of late)
> - to that point, though, it looks like one of my blunders was in the
> synonyms file. i just referenced /analysis/ again and realized "CANN" was
> being transposed to "cannula" instead of "cannulated" #facepalm
> - i'll be GLAD to use that! i'd been trying to use http://explain.solr.pl/
> previously but it kept error'ing out on me :\
>
> thanks again, will report back!
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | john@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull <
> dturnbull@opensourceconnections.com> wrote:
>
>> Hey John,
>>
>> I think you likely do need to think about escaping the query operators. I
>> doubt the Solr admin could tell the difference.
>>
>> For analysis, have you looked at the handy analysis tool in the Solr Admin
>> UI? Its pretty indespensible for figuring out if an analyzed query matches
>> an analyzed field.
>>
>> Outside of that, I can selfishly plug Splainer (http://splainer.io) that
>> gives you more insight into the Solr relevance explain. You would paste in
>> something like
>> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting).
>>
>> Cheers!
>> -Doug
>>
>> On Mon, May 18, 2015 at 3:02 PM, John Blythe <jo...@curvolabs.com> wrote:
>>
>> > Thanks again for the speediness, Doug.
>> >
>> > Good to know on some of those things, not least of all the + indicating
>> a
>> > mandatory field and the parentheses. It seems like the escaping is
>> pretty
>> > robust in light of the product number.
>> >
>> > I'm thinking it has to be largely related to the analyzer. Check this
>> out,
>> > this time with more of a real world case for us. Searching for
>> "descript2:
>> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated
>> screw PT
>> > 4.0x40mm" as its description. There is a document, though, that has the
>> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing
>> (minus
>> > lowercases) rendering that the analyzer is producing (per the /analysis
>> > page). Why would 4.0x40 come up first?  The top four results have
>> > 4.0x[Something]. It's not till the fifth result that you see a 3.5
>> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying
>> WTF.
>> > So close, but then it ignores the "50" for a "105" instead.
>> >
>> > Further, adding parenthesis around the phrase—"descript2: (CANN SCREW PT
>> > 3.5X50MM)"—produces top results that have the correct
>> dimensions—3.5x50—but
>> > the wrong type. Instead of "cannulated" screws we see "cortical." I'm
>> > convinced Solr is trolling me at this point :p
>> >
>> > --
>> > *John Blythe*
>> > Product Manager & Lead Developer
>> >
>> > 251.605.3071 | john@curvolabs.com
>> > www.curvolabs.com
>> >
>> > 58 Adams Ave
>> > Evansville, IN 47713
>> >
>> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
>> > dturnbull@opensourceconnections.com> wrote:
>> >
>> > > You might just need some syntax help. Not sure what the Solr admin
>> > escapes,
>> > > but many of the text in your query actually have reserved meaning.
>> Also,
>> > > when a term appears without a fieldName:value directly in front of
>> it, I
>> > > believe its going to search the default field (it's no longer
>> attached to
>> > > the field). You need to use parens to attach multiple terms to that
>> field
>> > > for search.
>> > >
>> > > I'd try to see if doing any of the following help:
>> > >
>> > > Add parens to group terms to the field:
>> > >
>> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
>> > 1.5pt)
>> > > +
>> > > productnumber:(001-029-1298)
>> > >
>> > > Also keep in mind "+" means mandatory, and its an operator on just one
>> > > field. So in the above you're requiring description and product number
>> > > match the provided terms.
>> > >
>> > > Further, you may need to escape the "-" as that means "NOT". You can
>> do
>> > > that with the following:
>> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
>> > 1.5pt)
>> > > +
>> > > productnumber:(001\-029\-1298)
>> > >
>> > > You can read more in the article on Solr query syntax
>> > > https://wiki.apache.org/solr/SolrQuerySyntax
>> > >
>> > > Hope that helps, for all I know your cut and paste didn't work and I'm
>> > > assuming you have syntax issues :)
>> > >
>> > > -Doug
>> > >
>> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <jo...@curvolabs.com>
>> wrote:
>> > >
>> > > > Hey Doug,
>> > > >
>> > > > Thanks for the quick reply.
>> > > >
>> > > > No edismax just yet. Planning on getting there, but have been
>> trying to
>> > > > fine tune the 3 primary fields we use over the last week or so
>> before
>> > > > jumping into edismax and its nifty toolset to help push our accuracy
>> > and
>> > > > precision even further (aside: is this a good strategy?)
>> > > >
>> > > > For now I'm querying directly in the admin interface, doing
>> something
>> > > like
>> > > > this:
>> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
>> > > 1.5pt +
>> > > > productnumber: 001-029-1298
>> > > >
>> > > > versus
>> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
>> > 1.5pt
>> > > >
>> > > > Another interesting and likely related factor is the description's
>> lack
>> > > of
>> > > > help. With the product number in place it gets nailed even with
>> stray
>> > > > zeros, 4's instead of 1's, etc.
>> > > >
>> > > > Without it, though, the querying just flat out sucks. For instance,
>> I
>> > > just
>> > > > saw something akin to this:
>> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt
>> > > >
>> > > > that got nowhere near what it should have. Straw would have a
>> synonym
>> > to
>> > > > map to strawberry and would match the document's description
>> *exactly,
>> > > *yet
>> > > > Solr would push out all sorts of peripheral suggestions that didn't
>> > match
>> > > > strawberry or was a different amount (.75pt, for instance). I know
>> I'm
>> > no
>> > > > expert, but I was thinking my analyzer was a bit better than that :p
>> > > >
>> > > > --
>> > > > *John Blythe*
>> > > > Product Manager & Lead Developer
>> > > >
>> > > > 251.605.3071 | john@curvolabs.com
>> > > > www.curvolabs.com
>> > > >
>> > > > 58 Adams Ave
>> > > > Evansville, IN 47713
>> > > >
>> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
>> > > > dturnbull@opensourceconnections.com> wrote:
>> > > >
>> > > > > > The maxScore is 772 when I remove the
>> > > > > description.
>> > > > > > I suppose the actual question, then, is if a low relevancy
>> score on
>> > > one
>> > > > > field
>> > > > > hurts the rest of them / the cumulative score,
>> > > > >
>> > > > > This depends a lot on how you're searching over these fields. Is
>> > this a
>> > > > > (e)dismax query? Or a lucene query? Something else?
>> > > > >
>> > > > > Across fields there's query normalization, which attempts to take
>> a
>> > sum
>> > > > of
>> > > > > squares of IDFs of the search terms across the fields being
>> searched.
>> > > > > Adding/removing a field could impact query normalization.
>> > > > >
>> > > > > By removing a field, you also likely remove a boolean clause. By
>> > > removing
>> > > > > the clause, there's less of a chance the coordinating factor
>> (known
>> > as
>> > > > > coord) would punish your relevancy score.
>> > > > >
>> > > > > Otherwise, don't know -- perhaps you could give us more
>> information
>> > on
>> > > > how
>> > > > > you're searching your documents? Perhaps a sample Solr URL that
>> shows
>> > > how
>> > > > > you're querying?
>> > > > >
>> > > > > Cheers,
>> > > > > --
>> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
>> > > Connections,
>> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
>> > > > > Author: Relevant Search <http://manning.com/turnbull> from
>> Manning
>> > > > > Publications
>> > > > > This e-mail and all contents, including attachments, is
>> considered to
>> > > be
>> > > > > Company Confidential unless explicitly stated otherwise,
>> regardless
>> > > > > of whether attachments are marked as such.
>> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <jo...@curvolabs.com>
>> > > wrote:
>> > > > >
>> > > > > > Background:
>> > > > > > I'm using Solr as a mechanism for search for users, but before
>> even
>> > > > > getting
>> > > > > > to that point as a means of intelligent inference more or less.
>> > > Product
>> > > > > > data comes in and we're hoping to match it to the correct known
>> > > product
>> > > > > > without having to use the user for confirmation/search.
>> > > > > >
>> > > > > > Problem:
>> > > > > > I get a maxScore (with the correct result at the top) of
>> 618.22626
>> > > > using
>> > > > > > the manufacturer's name, the product number, and the product
>> > > > description.
>> > > > > > All of these items are coming from a previous purchaser so we
>> have
>> > to
>> > > > > > account for manufacturer name variations, miskeying of product
>> > > numbers,
>> > > > > and
>> > > > > > variances of descriptions. The maxScore is 772 when I remove the
>> > > > > > description.
>> > > > > >
>> > > > > > My initial question is regarding relevancy scoring (
>> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that
>> many of
>> > > the
>> > > > > > description's tokens will be found throughout the other
>> documents,
>> > > thus
>> > > > > > keeping the relevancy at bay per the IDF portion of the
>> relevancy
>> > > > score.
>> > > > > I
>> > > > > > suppose the actual question, then, is if a low relevancy score
>> on
>> > one
>> > > > > field
>> > > > > > hurts the rest of them / the cumulative score, or if it simply
>> keep
>> > > > that
>> > > > > > field's contribution lower than it'd otherwise be. I thought it
>> was
>> > > the
>> > > > > > latter, but the results I mention above are making me think that
>> > the
>> > > > > first
>> > > > > > scenario is actually the case.
>> > > > > >
>> > > > > > Based on what I hear about the above, a follow up question may
>> be
>> > > what
>> > > > in
>> > > > > > the world is wrong with my analyzer :)
>> > > > > >
>> > > > > > Thanks for any thoughts!
>> > > > > >
>> > > > > > Best,
>> > > > > > John
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
>> Connections,
>> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
>> > > Author: Relevant Search <http://manning.com/turnbull> from Manning
>> > > Publications
>> > > This e-mail and all contents, including attachments, is considered to
>> be
>> > > Company Confidential unless explicitly stated otherwise, regardless
>> > > of whether attachments are marked as such.
>> > >
>> >
>>
>>
>>
>> --
>> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
>> LLC | 240.476.9983 | http://www.opensourceconnections.com
>> Author: Relevant Search <http://manning.com/turnbull> from Manning
>> Publications
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless
>> of whether attachments are marked as such.
>>
>
>

Re: Relevancy Scoring

Posted by John Blythe <jo...@curvolabs.com>.

Doug,

A couple things quickly:
- I'll check in to that. How would you go about testing things, direct URL?
If so, how would you compose one of the examples above?
- yup, I used it extensively before testing scores to ensure that I was
getting things parsed appropriately (segmenting off the unit of measure
[mm] whilst still maintaining the decimal instead of breaking it up was my
largest concern as of late)
- to that point, though, it looks like one of my blunders was in the
synonyms file. i just referenced /analysis/ again and realized "CANN" was
being transposed to "cannula" instead of "cannulated" #facepalm
- i'll be GLAD to use that! i'd been trying to use http://explain.solr.pl/
previously but it kept error'ing out on me :\

thanks again, will report back!

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | john@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> Hey John,
>
> I think you likely do need to think about escaping the query operators. I
> doubt the Solr admin could tell the difference.
>
> For analysis, have you looked at the handy analysis tool in the Solr Admin
> UI? Its pretty indespensible for figuring out if an analyzed query matches
> an analyzed field.
>
> Outside of that, I can selfishly plug Splainer (http://splainer.io) that
> gives you more insight into the Solr relevance explain. You would paste in
> something like
> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting).
>
> Cheers!
> -Doug
>
> On Mon, May 18, 2015 at 3:02 PM, John Blythe <jo...@curvolabs.com> wrote:
>
> > Thanks again for the speediness, Doug.
> >
> > Good to know on some of those things, not least of all the + indicating a
> > mandatory field and the parentheses. It seems like the escaping is pretty
> > robust in light of the product number.
> >
> > I'm thinking it has to be largely related to the analyzer. Check this
> out,
> > this time with more of a real world case for us. Searching for
> "descript2:
> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated screw
> PT
> > 4.0x40mm" as its description. There is a document, though, that has the
> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing (minus
> > lowercases) rendering that the analyzer is producing (per the /analysis
> > page). Why would 4.0x40 come up first?  The top four results have
> > 4.0x[Something]. It's not till the fifth result that you see a 3.5
> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying WTF.
> > So close, but then it ignores the "50" for a "105" instead.
> >
> > Further, adding parenthesis around the phrase—"descript2: (CANN SCREW PT
> > 3.5X50MM)"—produces top results that have the correct
> dimensions—3.5x50—but
> > the wrong type. Instead of "cannulated" screws we see "cortical." I'm
> > convinced Solr is trolling me at this point :p
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | john@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
> > dturnbull@opensourceconnections.com> wrote:
> >
> > > You might just need some syntax help. Not sure what the Solr admin
> > escapes,
> > > but many of the text in your query actually have reserved meaning.
> Also,
> > > when a term appears without a fieldName:value directly in front of it,
> I
> > > believe its going to search the default field (it's no longer attached
> to
> > > the field). You need to use parens to attach multiple terms to that
> field
> > > for search.
> > >
> > > I'd try to see if doing any of the following help:
> > >
> > > Add parens to group terms to the field:
> > >
> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> > 1.5pt)
> > > +
> > > productnumber:(001-029-1298)
> > >
> > > Also keep in mind "+" means mandatory, and its an operator on just one
> > > field. So in the above you're requiring description and product number
> > > match the provided terms.
> > >
> > > Further, you may need to escape the "-" as that means "NOT". You can do
> > > that with the following:
> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> > 1.5pt)
> > > +
> > > productnumber:(001\-029\-1298)
> > >
> > > You can read more in the article on Solr query syntax
> > > https://wiki.apache.org/solr/SolrQuerySyntax
> > >
> > > Hope that helps, for all I know your cut and paste didn't work and I'm
> > > assuming you have syntax issues :)
> > >
> > > -Doug
> > >
> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <jo...@curvolabs.com>
> wrote:
> > >
> > > > Hey Doug,
> > > >
> > > > Thanks for the quick reply.
> > > >
> > > > No edismax just yet. Planning on getting there, but have been trying
> to
> > > > fine tune the 3 primary fields we use over the last week or so before
> > > > jumping into edismax and its nifty toolset to help push our accuracy
> > and
> > > > precision even further (aside: is this a good strategy?)
> > > >
> > > > For now I'm querying directly in the admin interface, doing something
> > > like
> > > > this:
> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
> > > 1.5pt +
> > > > productnumber: 001-029-1298
> > > >
> > > > versus
> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
> > 1.5pt
> > > >
> > > > Another interesting and likely related factor is the description's
> lack
> > > of
> > > > help. With the product number in place it gets nailed even with stray
> > > > zeros, 4's instead of 1's, etc.
> > > >
> > > > Without it, though, the querying just flat out sucks. For instance, I
> > > just
> > > > saw something akin to this:
> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt
> > > >
> > > > that got nowhere near what it should have. Straw would have a synonym
> > to
> > > > map to strawberry and would match the document's description
> *exactly,
> > > *yet
> > > > Solr would push out all sorts of peripheral suggestions that didn't
> > match
> > > > strawberry or was a different amount (.75pt, for instance). I know
> I'm
> > no
> > > > expert, but I was thinking my analyzer was a bit better than that :p
> > > >
> > > > --
> > > > *John Blythe*
> > > > Product Manager & Lead Developer
> > > >
> > > > 251.605.3071 | john@curvolabs.com
> > > > www.curvolabs.com
> > > >
> > > > 58 Adams Ave
> > > > Evansville, IN 47713
> > > >
> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> > > > dturnbull@opensourceconnections.com> wrote:
> > > >
> > > > > > The maxScore is 772 when I remove the
> > > > > description.
> > > > > > I suppose the actual question, then, is if a low relevancy score
> on
> > > one
> > > > > field
> > > > > hurts the rest of them / the cumulative score,
> > > > >
> > > > > This depends a lot on how you're searching over these fields. Is
> > this a
> > > > > (e)dismax query? Or a lucene query? Something else?
> > > > >
> > > > > Across fields there's query normalization, which attempts to take a
> > sum
> > > > of
> > > > > squares of IDFs of the search terms across the fields being
> searched.
> > > > > Adding/removing a field could impact query normalization.
> > > > >
> > > > > By removing a field, you also likely remove a boolean clause. By
> > > removing
> > > > > the clause, there's less of a chance the coordinating factor (known
> > as
> > > > > coord) would punish your relevancy score.
> > > > >
> > > > > Otherwise, don't know -- perhaps you could give us more information
> > on
> > > > how
> > > > > you're searching your documents? Perhaps a sample Solr URL that
> shows
> > > how
> > > > > you're querying?
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > > Connections,
> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > > > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > > > > Publications
> > > > > This e-mail and all contents, including attachments, is considered
> to
> > > be
> > > > > Company Confidential unless explicitly stated otherwise, regardless
> > > > > of whether attachments are marked as such.
> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <jo...@curvolabs.com>
> > > wrote:
> > > > >
> > > > > > Background:
> > > > > > I'm using Solr as a mechanism for search for users, but before
> even
> > > > > getting
> > > > > > to that point as a means of intelligent inference more or less.
> > > Product
> > > > > > data comes in and we're hoping to match it to the correct known
> > > product
> > > > > > without having to use the user for confirmation/search.
> > > > > >
> > > > > > Problem:
> > > > > > I get a maxScore (with the correct result at the top) of
> 618.22626
> > > > using
> > > > > > the manufacturer's name, the product number, and the product
> > > > description.
> > > > > > All of these items are coming from a previous purchaser so we
> have
> > to
> > > > > > account for manufacturer name variations, miskeying of product
> > > numbers,
> > > > > and
> > > > > > variances of descriptions. The maxScore is 772 when I remove the
> > > > > > description.
> > > > > >
> > > > > > My initial question is regarding relevancy scoring (
> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many
> of
> > > the
> > > > > > description's tokens will be found throughout the other
> documents,
> > > thus
> > > > > > keeping the relevancy at bay per the IDF portion of the relevancy
> > > > score.
> > > > > I
> > > > > > suppose the actual question, then, is if a low relevancy score on
> > one
> > > > > field
> > > > > > hurts the rest of them / the cumulative score, or if it simply
> keep
> > > > that
> > > > > > field's contribution lower than it'd otherwise be. I thought it
> was
> > > the
> > > > > > latter, but the results I mention above are making me think that
> > the
> > > > > first
> > > > > > scenario is actually the case.
> > > > > >
> > > > > > Based on what I hear about the above, a follow up question may be
> > > what
> > > > in
> > > > > > the world is wrong with my analyzer :)
> > > > > >
> > > > > > Thanks for any thoughts!
> > > > > >
> > > > > > Best,
> > > > > > John
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> Connections,
> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > > Publications
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> > >
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Relevant Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: Relevancy Scoring

Posted by Doug Turnbull <dt...@opensourceconnections.com>.

Hey John,

I think you likely do need to think about escaping the query operators. I
doubt the Solr admin could tell the difference.

For analysis, have you looked at the handy analysis tool in the Solr Admin
UI? Its pretty indespensible for figuring out if an analyzed query matches
an analyzed field.

Outside of that, I can selfishly plug Splainer (http://splainer.io) that
gives you more insight into the Solr relevance explain. You would paste in
something like
http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting).

Cheers!
-Doug

On Mon, May 18, 2015 at 3:02 PM, John Blythe <jo...@curvolabs.com> wrote:

> Thanks again for the speediness, Doug.
>
> Good to know on some of those things, not least of all the + indicating a
> mandatory field and the parentheses. It seems like the escaping is pretty
> robust in light of the product number.
>
> I'm thinking it has to be largely related to the analyzer. Check this out,
> this time with more of a real world case for us. Searching for "descript2:
> CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated screw PT
> 4.0x40mm" as its description. There is a document, though, that has the
> description of "Cannulated screw PT 3.5x50mm"—the exact same thing (minus
> lowercases) rendering that the analyzer is producing (per the /analysis
> page). Why would 4.0x40 come up first?  The top four results have
> 4.0x[Something]. It's not till the fifth result that you see a 3.5
> something: "Cannulated screw PT 3.5x105mm" at which point I'm saying WTF.
> So close, but then it ignores the "50" for a "105" instead.
>
> Further, adding parenthesis around the phrase—"descript2: (CANN SCREW PT
> 3.5X50MM)"—produces top results that have the correct dimensions—3.5x50—but
> the wrong type. Instead of "cannulated" screws we see "cortical." I'm
> convinced Solr is trolling me at this point :p
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | john@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
> dturnbull@opensourceconnections.com> wrote:
>
> > You might just need some syntax help. Not sure what the Solr admin
> escapes,
> > but many of the text in your query actually have reserved meaning. Also,
> > when a term appears without a fieldName:value directly in front of it, I
> > believe its going to search the default field (it's no longer attached to
> > the field). You need to use parens to attach multiple terms to that field
> > for search.
> >
> > I'd try to see if doing any of the following help:
> >
> > Add parens to group terms to the field:
> >
> > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> 1.5pt)
> > +
> > productnumber:(001-029-1298)
> >
> > Also keep in mind "+" means mandatory, and its an operator on just one
> > field. So in the above you're requiring description and product number
> > match the provided terms.
> >
> > Further, you may need to escape the "-" as that means "NOT". You can do
> > that with the following:
> > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> 1.5pt)
> > +
> > productnumber:(001\-029\-1298)
> >
> > You can read more in the article on Solr query syntax
> > https://wiki.apache.org/solr/SolrQuerySyntax
> >
> > Hope that helps, for all I know your cut and paste didn't work and I'm
> > assuming you have syntax issues :)
> >
> > -Doug
> >
> > On Mon, May 18, 2015 at 2:25 PM, John Blythe <jo...@curvolabs.com> wrote:
> >
> > > Hey Doug,
> > >
> > > Thanks for the quick reply.
> > >
> > > No edismax just yet. Planning on getting there, but have been trying to
> > > fine tune the 3 primary fields we use over the last week or so before
> > > jumping into edismax and its nifty toolset to help push our accuracy
> and
> > > precision even further (aside: is this a good strategy?)
> > >
> > > For now I'm querying directly in the admin interface, doing something
> > like
> > > this:
> > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
> > 1.5pt +
> > > productnumber: 001-029-1298
> > >
> > > versus
> > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
> 1.5pt
> > >
> > > Another interesting and likely related factor is the description's lack
> > of
> > > help. With the product number in place it gets nailed even with stray
> > > zeros, 4's instead of 1's, etc.
> > >
> > > Without it, though, the querying just flat out sucks. For instance, I
> > just
> > > saw something akin to this:
> > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt
> > >
> > > that got nowhere near what it should have. Straw would have a synonym
> to
> > > map to strawberry and would match the document's description *exactly,
> > *yet
> > > Solr would push out all sorts of peripheral suggestions that didn't
> match
> > > strawberry or was a different amount (.75pt, for instance). I know I'm
> no
> > > expert, but I was thinking my analyzer was a bit better than that :p
> > >
> > > --
> > > *John Blythe*
> > > Product Manager & Lead Developer
> > >
> > > 251.605.3071 | john@curvolabs.com
> > > www.curvolabs.com
> > >
> > > 58 Adams Ave
> > > Evansville, IN 47713
> > >
> > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> > > dturnbull@opensourceconnections.com> wrote:
> > >
> > > > > The maxScore is 772 when I remove the
> > > > description.
> > > > > I suppose the actual question, then, is if a low relevancy score on
> > one
> > > > field
> > > > hurts the rest of them / the cumulative score,
> > > >
> > > > This depends a lot on how you're searching over these fields. Is
> this a
> > > > (e)dismax query? Or a lucene query? Something else?
> > > >
> > > > Across fields there's query normalization, which attempts to take a
> sum
> > > of
> > > > squares of IDFs of the search terms across the fields being searched.
> > > > Adding/removing a field could impact query normalization.
> > > >
> > > > By removing a field, you also likely remove a boolean clause. By
> > removing
> > > > the clause, there's less of a chance the coordinating factor (known
> as
> > > > coord) would punish your relevancy score.
> > > >
> > > > Otherwise, don't know -- perhaps you could give us more information
> on
> > > how
> > > > you're searching your documents? Perhaps a sample Solr URL that shows
> > how
> > > > you're querying?
> > > >
> > > > Cheers,
> > > > --
> > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > Connections,
> > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > > > Publications
> > > > This e-mail and all contents, including attachments, is considered to
> > be
> > > > Company Confidential unless explicitly stated otherwise, regardless
> > > > of whether attachments are marked as such.
> > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <jo...@curvolabs.com>
> > wrote:
> > > >
> > > > > Background:
> > > > > I'm using Solr as a mechanism for search for users, but before even
> > > > getting
> > > > > to that point as a means of intelligent inference more or less.
> > Product
> > > > > data comes in and we're hoping to match it to the correct known
> > product
> > > > > without having to use the user for confirmation/search.
> > > > >
> > > > > Problem:
> > > > > I get a maxScore (with the correct result at the top) of 618.22626
> > > using
> > > > > the manufacturer's name, the product number, and the product
> > > description.
> > > > > All of these items are coming from a previous purchaser so we have
> to
> > > > > account for manufacturer name variations, miskeying of product
> > numbers,
> > > > and
> > > > > variances of descriptions. The maxScore is 772 when I remove the
> > > > > description.
> > > > >
> > > > > My initial question is regarding relevancy scoring (
> > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many of
> > the
> > > > > description's tokens will be found throughout the other documents,
> > thus
> > > > > keeping the relevancy at bay per the IDF portion of the relevancy
> > > score.
> > > > I
> > > > > suppose the actual question, then, is if a low relevancy score on
> one
> > > > field
> > > > > hurts the rest of them / the cumulative score, or if it simply keep
> > > that
> > > > > field's contribution lower than it'd otherwise be. I thought it was
> > the
> > > > > latter, but the results I mention above are making me think that
> the
> > > > first
> > > > > scenario is actually the case.
> > > > >
> > > > > Based on what I hear about the above, a follow up question may be
> > what
> > > in
> > > > > the world is wrong with my analyzer :)
> > > > >
> > > > > Thanks for any thoughts!
> > > > >
> > > > > Best,
> > > > > John
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > Publications
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
> >
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search <http://manning.com/turnbull> from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Relevancy Scoring

Posted by John Blythe <jo...@curvolabs.com>.

Thanks again for the speediness, Doug.

Good to know on some of those things, not least of all the + indicating a
mandatory field and the parentheses. It seems like the escaping is pretty
robust in light of the product number.

I'm thinking it has to be largely related to the analyzer. Check this out,
this time with more of a real world case for us. Searching for "descript2:
CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated screw PT
4.0x40mm" as its description. There is a document, though, that has the
description of "Cannulated screw PT 3.5x50mm"—the exact same thing (minus
lowercases) rendering that the analyzer is producing (per the /analysis
page). Why would 4.0x40 come up first?  The top four results have
4.0x[Something]. It's not till the fifth result that you see a 3.5
something: "Cannulated screw PT 3.5x105mm" at which point I'm saying WTF.
So close, but then it ignores the "50" for a "105" instead.

Further, adding parenthesis around the phrase—"descript2: (CANN SCREW PT
3.5X50MM)"—produces top results that have the correct dimensions—3.5x50—but
the wrong type. Instead of "cannulated" screws we see "cortical." I'm
convinced Solr is trolling me at this point :p

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | john@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> You might just need some syntax help. Not sure what the Solr admin escapes,
> but many of the text in your query actually have reserved meaning. Also,
> when a term appears without a fieldName:value directly in front of it, I
> believe its going to search the default field (it's no longer attached to
> the field). You need to use parens to attach multiple terms to that field
> for search.
>
> I'd try to see if doing any of the following help:
>
> Add parens to group terms to the field:
>
> mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream 1.5pt)
> +
> productnumber:(001-029-1298)
>
> Also keep in mind "+" means mandatory, and its an operator on just one
> field. So in the above you're requiring description and product number
> match the provided terms.
>
> Further, you may need to escape the "-" as that means "NOT". You can do
> that with the following:
> mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream 1.5pt)
> +
> productnumber:(001\-029\-1298)
>
> You can read more in the article on Solr query syntax
> https://wiki.apache.org/solr/SolrQuerySyntax
>
> Hope that helps, for all I know your cut and paste didn't work and I'm
> assuming you have syntax issues :)
>
> -Doug
>
> On Mon, May 18, 2015 at 2:25 PM, John Blythe <jo...@curvolabs.com> wrote:
>
> > Hey Doug,
> >
> > Thanks for the quick reply.
> >
> > No edismax just yet. Planning on getting there, but have been trying to
> > fine tune the 3 primary fields we use over the last week or so before
> > jumping into edismax and its nifty toolset to help push our accuracy and
> > precision even further (aside: is this a good strategy?)
> >
> > For now I'm querying directly in the admin interface, doing something
> like
> > this:
> > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
> 1.5pt +
> > productnumber: 001-029-1298
> >
> > versus
> > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream 1.5pt
> >
> > Another interesting and likely related factor is the description's lack
> of
> > help. With the product number in place it gets nailed even with stray
> > zeros, 4's instead of 1's, etc.
> >
> > Without it, though, the querying just flat out sucks. For instance, I
> just
> > saw something akin to this:
> > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt
> >
> > that got nowhere near what it should have. Straw would have a synonym to
> > map to strawberry and would match the document's description *exactly,
> *yet
> > Solr would push out all sorts of peripheral suggestions that didn't match
> > strawberry or was a different amount (.75pt, for instance). I know I'm no
> > expert, but I was thinking my analyzer was a bit better than that :p
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | john@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> > dturnbull@opensourceconnections.com> wrote:
> >
> > > > The maxScore is 772 when I remove the
> > > description.
> > > > I suppose the actual question, then, is if a low relevancy score on
> one
> > > field
> > > hurts the rest of them / the cumulative score,
> > >
> > > This depends a lot on how you're searching over these fields. Is this a
> > > (e)dismax query? Or a lucene query? Something else?
> > >
> > > Across fields there's query normalization, which attempts to take a sum
> > of
> > > squares of IDFs of the search terms across the fields being searched.
> > > Adding/removing a field could impact query normalization.
> > >
> > > By removing a field, you also likely remove a boolean clause. By
> removing
> > > the clause, there's less of a chance the coordinating factor (known as
> > > coord) would punish your relevancy score.
> > >
> > > Otherwise, don't know -- perhaps you could give us more information on
> > how
> > > you're searching your documents? Perhaps a sample Solr URL that shows
> how
> > > you're querying?
> > >
> > > Cheers,
> > > --
> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> Connections,
> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > > Publications
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <jo...@curvolabs.com>
> wrote:
> > >
> > > > Background:
> > > > I'm using Solr as a mechanism for search for users, but before even
> > > getting
> > > > to that point as a means of intelligent inference more or less.
> Product
> > > > data comes in and we're hoping to match it to the correct known
> product
> > > > without having to use the user for confirmation/search.
> > > >
> > > > Problem:
> > > > I get a maxScore (with the correct result at the top) of 618.22626
> > using
> > > > the manufacturer's name, the product number, and the product
> > description.
> > > > All of these items are coming from a previous purchaser so we have to
> > > > account for manufacturer name variations, miskeying of product
> numbers,
> > > and
> > > > variances of descriptions. The maxScore is 772 when I remove the
> > > > description.
> > > >
> > > > My initial question is regarding relevancy scoring (
> > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many of
> the
> > > > description's tokens will be found throughout the other documents,
> thus
> > > > keeping the relevancy at bay per the IDF portion of the relevancy
> > score.
> > > I
> > > > suppose the actual question, then, is if a low relevancy score on one
> > > field
> > > > hurts the rest of them / the cumulative score, or if it simply keep
> > that
> > > > field's contribution lower than it'd otherwise be. I thought it was
> the
> > > > latter, but the results I mention above are making me think that the
> > > first
> > > > scenario is actually the case.
> > > >
> > > > Based on what I hear about the above, a follow up question may be
> what
> > in
> > > > the world is wrong with my analyzer :)
> > > >
> > > > Thanks for any thoughts!
> > > >
> > > > Best,
> > > > John
> > > >
> > >
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Relevant Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: Relevancy Scoring

Posted by Doug Turnbull <dt...@opensourceconnections.com>.

You might just need some syntax help. Not sure what the Solr admin escapes,
but many of the text in your query actually have reserved meaning. Also,
when a term appears without a fieldName:value directly in front of it, I
believe its going to search the default field (it's no longer attached to
the field). You need to use parens to attach multiple terms to that field
for search.

I'd try to see if doing any of the following help:

Add parens to group terms to the field:

mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream 1.5pt) +
productnumber:(001-029-1298)

Also keep in mind "+" means mandatory, and its an operator on just one
field. So in the above you're requiring description and product number
match the provided terms.

Further, you may need to escape the "-" as that means "NOT". You can do
that with the following:
mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream 1.5pt) +
productnumber:(001\-029\-1298)

You can read more in the article on Solr query syntax
https://wiki.apache.org/solr/SolrQuerySyntax

Hope that helps, for all I know your cut and paste didn't work and I'm
assuming you have syntax issues :)

-Doug

On Mon, May 18, 2015 at 2:25 PM, John Blythe <jo...@curvolabs.com> wrote:

> Hey Doug,
>
> Thanks for the quick reply.
>
> No edismax just yet. Planning on getting there, but have been trying to
> fine tune the 3 primary fields we use over the last week or so before
> jumping into edismax and its nifty toolset to help push our accuracy and
> precision even further (aside: is this a good strategy?)
>
> For now I'm querying directly in the admin interface, doing something like
> this:
> mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream 1.5pt +
> productnumber: 001-029-1298
>
> versus
> mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream 1.5pt
>
> Another interesting and likely related factor is the description's lack of
> help. With the product number in place it gets nailed even with stray
> zeros, 4's instead of 1's, etc.
>
> Without it, though, the querying just flat out sucks. For instance, I just
> saw something akin to this:
> mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt
>
> that got nowhere near what it should have. Straw would have a synonym to
> map to strawberry and would match the document's description *exactly, *yet
> Solr would push out all sorts of peripheral suggestions that didn't match
> strawberry or was a different amount (.75pt, for instance). I know I'm no
> expert, but I was thinking my analyzer was a bit better than that :p
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | john@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> dturnbull@opensourceconnections.com> wrote:
>
> > > The maxScore is 772 when I remove the
> > description.
> > > I suppose the actual question, then, is if a low relevancy score on one
> > field
> > hurts the rest of them / the cumulative score,
> >
> > This depends a lot on how you're searching over these fields. Is this a
> > (e)dismax query? Or a lucene query? Something else?
> >
> > Across fields there's query normalization, which attempts to take a sum
> of
> > squares of IDFs of the search terms across the fields being searched.
> > Adding/removing a field could impact query normalization.
> >
> > By removing a field, you also likely remove a boolean clause. By removing
> > the clause, there's less of a chance the coordinating factor (known as
> > coord) would punish your relevancy score.
> >
> > Otherwise, don't know -- perhaps you could give us more information on
> how
> > you're searching your documents? Perhaps a sample Solr URL that shows how
> > you're querying?
> >
> > Cheers,
> > --
> > *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > Publications
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
> > On Mon, May 18, 2015 at 1:57 PM, John Blythe <jo...@curvolabs.com> wrote:
> >
> > > Background:
> > > I'm using Solr as a mechanism for search for users, but before even
> > getting
> > > to that point as a means of intelligent inference more or less. Product
> > > data comes in and we're hoping to match it to the correct known product
> > > without having to use the user for confirmation/search.
> > >
> > > Problem:
> > > I get a maxScore (with the correct result at the top) of 618.22626
> using
> > > the manufacturer's name, the product number, and the product
> description.
> > > All of these items are coming from a previous purchaser so we have to
> > > account for manufacturer name variations, miskeying of product numbers,
> > and
> > > variances of descriptions. The maxScore is 772 when I remove the
> > > description.
> > >
> > > My initial question is regarding relevancy scoring (
> > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many of the
> > > description's tokens will be found throughout the other documents, thus
> > > keeping the relevancy at bay per the IDF portion of the relevancy
> score.
> > I
> > > suppose the actual question, then, is if a low relevancy score on one
> > field
> > > hurts the rest of them / the cumulative score, or if it simply keep
> that
> > > field's contribution lower than it'd otherwise be. I thought it was the
> > > latter, but the results I mention above are making me think that the
> > first
> > > scenario is actually the case.
> > >
> > > Based on what I hear about the above, a follow up question may be what
> in
> > > the world is wrong with my analyzer :)
> > >
> > > Thanks for any thoughts!
> > >
> > > Best,
> > > John
> > >
> >
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search <http://manning.com/turnbull> from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Relevancy Scoring

Posted by John Blythe <jo...@curvolabs.com>.

Hey Doug,

Thanks for the quick reply.

No edismax just yet. Planning on getting there, but have been trying to
fine tune the 3 primary fields we use over the last week or so before
jumping into edismax and its nifty toolset to help push our accuracy and
precision even further (aside: is this a good strategy?)

For now I'm querying directly in the admin interface, doing something like
this:
mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream 1.5pt +
productnumber: 001-029-1298

versus
mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream 1.5pt

Another interesting and likely related factor is the description's lack of
help. With the product number in place it gets nailed even with stray
zeros, 4's instead of 1's, etc.

Without it, though, the querying just flat out sucks. For instance, I just
saw something akin to this:
mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt

that got nowhere near what it should have. Straw would have a synonym to
map to strawberry and would match the document's description *exactly, *yet
Solr would push out all sorts of peripheral suggestions that didn't match
strawberry or was a different amount (.75pt, for instance). I know I'm no
expert, but I was thinking my analyzer was a bit better than that :p

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | john@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> > The maxScore is 772 when I remove the
> description.
> > I suppose the actual question, then, is if a low relevancy score on one
> field
> hurts the rest of them / the cumulative score,
>
> This depends a lot on how you're searching over these fields. Is this a
> (e)dismax query? Or a lucene query? Something else?
>
> Across fields there's query normalization, which attempts to take a sum of
> squares of IDFs of the search terms across the fields being searched.
> Adding/removing a field could impact query normalization.
>
> By removing a field, you also likely remove a boolean clause. By removing
> the clause, there's less of a chance the coordinating factor (known as
> coord) would punish your relevancy score.
>
> Otherwise, don't know -- perhaps you could give us more information on how
> you're searching your documents? Perhaps a sample Solr URL that shows how
> you're querying?
>
> Cheers,
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Relevant Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
> On Mon, May 18, 2015 at 1:57 PM, John Blythe <jo...@curvolabs.com> wrote:
>
> > Background:
> > I'm using Solr as a mechanism for search for users, but before even
> getting
> > to that point as a means of intelligent inference more or less. Product
> > data comes in and we're hoping to match it to the correct known product
> > without having to use the user for confirmation/search.
> >
> > Problem:
> > I get a maxScore (with the correct result at the top) of 618.22626 using
> > the manufacturer's name, the product number, and the product description.
> > All of these items are coming from a previous purchaser so we have to
> > account for manufacturer name variations, miskeying of product numbers,
> and
> > variances of descriptions. The maxScore is 772 when I remove the
> > description.
> >
> > My initial question is regarding relevancy scoring (
> > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many of the
> > description's tokens will be found throughout the other documents, thus
> > keeping the relevancy at bay per the IDF portion of the relevancy score.
> I
> > suppose the actual question, then, is if a low relevancy score on one
> field
> > hurts the rest of them / the cumulative score, or if it simply keep that
> > field's contribution lower than it'd otherwise be. I thought it was the
> > latter, but the results I mention above are making me think that the
> first
> > scenario is actually the case.
> >
> > Based on what I hear about the above, a follow up question may be what in
> > the world is wrong with my analyzer :)
> >
> > Thanks for any thoughts!
> >
> > Best,
> > John
> >
>

Re: Relevancy Scoring

Posted by Doug Turnbull <dt...@opensourceconnections.com>.

Also, I wouldn't expect at all that query-to-query you'll get comparable
scores. I'm not at all surprised that suddenly you get big swings in
scoring. So many parts of the scoring equation can change query to query.

On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> > The maxScore is 772 when I remove the
> description.
> > I suppose the actual question, then, is if a low relevancy score on one
> field
> hurts the rest of them / the cumulative score,
>
> This depends a lot on how you're searching over these fields. Is this a
> (e)dismax query? Or a lucene query? Something else?
>
> Across fields there's query normalization, which attempts to take a sum of
> squares of IDFs of the search terms across the fields being searched.
> Adding/removing a field could impact query normalization.
>
> By removing a field, you also likely remove a boolean clause. By removing
> the clause, there's less of a chance the coordinating factor (known as
> coord) would punish your relevancy score.
>
> Otherwise, don't know -- perhaps you could give us more information on how
> you're searching your documents? Perhaps a sample Solr URL that shows how
> you're querying?
>
> Cheers,
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Relevant Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
> On Mon, May 18, 2015 at 1:57 PM, John Blythe <jo...@curvolabs.com> wrote:
>
>> Background:
>> I'm using Solr as a mechanism for search for users, but before even
>> getting
>> to that point as a means of intelligent inference more or less. Product
>> data comes in and we're hoping to match it to the correct known product
>> without having to use the user for confirmation/search.
>>
>> Problem:
>> I get a maxScore (with the correct result at the top) of 618.22626 using
>> the manufacturer's name, the product number, and the product description.
>> All of these items are coming from a previous purchaser so we have to
>> account for manufacturer name variations, miskeying of product numbers,
>> and
>> variances of descriptions. The maxScore is 772 when I remove the
>> description.
>>
>> My initial question is regarding relevancy scoring (
>> https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many of the
>> description's tokens will be found throughout the other documents, thus
>> keeping the relevancy at bay per the IDF portion of the relevancy score. I
>> suppose the actual question, then, is if a low relevancy score on one
>> field
>> hurts the rest of them / the cumulative score, or if it simply keep that
>> field's contribution lower than it'd otherwise be. I thought it was the
>> latter, but the results I mention above are making me think that the first
>> scenario is actually the case.
>>
>> Based on what I hear about the above, a follow up question may be what in
>> the world is wrong with my analyzer :)
>>
>> Thanks for any thoughts!
>>
>> Best,
>> John
>>
>
>
>
>
>


-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search <http://manning.com/turnbull> from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Relevancy Scoring

Posted by Doug Turnbull <dt...@opensourceconnections.com>.

> The maxScore is 772 when I remove the
description.
> I suppose the actual question, then, is if a low relevancy score on one
field
hurts the rest of them / the cumulative score,

This depends a lot on how you're searching over these fields. Is this a
(e)dismax query? Or a lucene query? Something else?

Across fields there's query normalization, which attempts to take a sum of
squares of IDFs of the search terms across the fields being searched.
Adding/removing a field could impact query normalization.

By removing a field, you also likely remove a boolean clause. By removing
the clause, there's less of a chance the coordinating factor (known as
coord) would punish your relevancy score.

Otherwise, don't know -- perhaps you could give us more information on how
you're searching your documents? Perhaps a sample Solr URL that shows how
you're querying?

Cheers,
-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search <http://manning.com/turnbull> from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
On Mon, May 18, 2015 at 1:57 PM, John Blythe <jo...@curvolabs.com> wrote:

> Background:
> I'm using Solr as a mechanism for search for users, but before even getting
> to that point as a means of intelligent inference more or less. Product
> data comes in and we're hoping to match it to the correct known product
> without having to use the user for confirmation/search.
>
> Problem:
> I get a maxScore (with the correct result at the top) of 618.22626 using
> the manufacturer's name, the product number, and the product description.
> All of these items are coming from a previous purchaser so we have to
> account for manufacturer name variations, miskeying of product numbers, and
> variances of descriptions. The maxScore is 772 when I remove the
> description.
>
> My initial question is regarding relevancy scoring (
> https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many of the
> description's tokens will be found throughout the other documents, thus
> keeping the relevancy at bay per the IDF portion of the relevancy score. I
> suppose the actual question, then, is if a low relevancy score on one field
> hurts the rest of them / the cumulative score, or if it simply keep that
> field's contribution lower than it'd otherwise be. I thought it was the
> latter, but the results I mention above are making me think that the first
> scenario is actually the case.
>
> Based on what I hear about the above, a follow up question may be what in
> the world is wrong with my analyzer :)
>
> Thanks for any thoughts!
>
> Best,
> John
>