You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Bickerstaff <jo...@johnbickerstaff.com> on 2016/08/11 18:20:51 UTC

Want zero results from SOLR when there are no matches for "querystring"

First let me say that this is very possibly the "x - y problem" so let me
state up front what my ultimate need is -- then I'll ask about the thing I
imagine might help...  which, of course, is heavily biased in the direction
of my experience coding Java and writing SQL...

I have a piece of a query that calculates a score based on a "weighting"
number stored in each solr doc.  I'm including the xml for my custom
endpoint below...

The specific line is this:
<str name="bf">product(field(category_weight),20)</str>

What I just realized is that when I query Solr for a string that has NO
matches in the entire corpus, I still get a slew of results because EVERY
doc has the weighting value in the category_weight field - and therefore
every doc gets some score.

What I would like is to return zero results if there is no match for the
querystring.  My collection is small enough that I don't care if the actual
calculation runs on each doc (although that's wasteful) -- I just don't
want to see results come back for zero matches to the querystring

(The /select endpoint does this of course, but my custom endpoint includes
this "weighting" piece and therefore returns every doc in the corpus
because they all have the weighting.

====================
Enter my imagined solution...  The potential X-Y problem...
====================

So - given that I come from a programming background, I immediately start
thinking of an if statement ...

     if(some_score_for_the_primary_search_string) {
          run_the_category_weight_calculation;
     } else {
          do_NOT_run_category_weight_calc;
     }


Another way of thinking of it would be something like the "WHERE" clause in
SQL...

 run_category_weight_calculation WHERE "searchstring" is found in the
document, not otherwise.

I'm aware that things could be handled in the client-side of my web app,
but if possible, I'd like the interface to SOLR to be as clean as possible,
and massage incoming SOLR data as little as possible.

In other words, do NOT return any docs if the querystring (and any
synonyms) match zero docs.

Here is the endpoint XML for the query.  I've highlighted the specific line
that is causing the unintended results...


 <requestHandler name="/foo" class="solr.SearchHandler">
    <!-- default values for query parameters can be specified, these
         will be overridden by parameters in the request
      -->
     <lst name="defaults">
       <str name="echoParams">all</str>
       <int name="rows">20</int>
       <!-- Query settings -->
       <str name="df">text</str>
      <!-- <str name="df">title</str> -->
       <str name="defType">synonym_edismax</str>>
       <str name="synonyms">true</str>
    <!-- The line below balances out the weighting of exact matches to the
synonym phrase entered by the user
         with the category_weight calculation and the titleQuery calc.
These numbers exist in a balance and
         if one is raised or lowered, the others (probably) need to change
as well.  It may be better to go with decimals
         for all of them... .4 instead of 4 and 2 instead of 20 and 2.5
instead of 25.
         In the end, I'm not sure it really matters, but don't change one
without changing the others
         unless you've tested and are sure you want the results  -->
       <float name="synonyms.originalBoost">1.5</float>
       <float name="synonyms.synonymBoost">1.1</float>
       <str name="mm">75%</str>
       <str name="q.alt">*:*</str>
       <str name="rows">20</str>
       <str name="fq">meta_doc_type:chapterDoc</str>
       <str name="bq">{!synonym_edismax qf='title' synonyms='true'
synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
v=$q}</str>
       <str name="fl">id category_weight title category_ss score
contentType</str>
       <str name="titleQuery">{!edismax qf='title' bf='' bq='' v=$q}</str>
=====================================================
       *<str name="bf">product(field(category_weight),20)</str>*
=====================================================
       <str name="bf">product(query($titleQuery),4)</str>
       <str name="qf">text contentType^1000</str>
       <str name="wt">python</str>
       <str name="debug">true</str>
       <str name="debug.explain.structured">true</str>
       <str name="indent">true</str>
       <str name="echoParams">all</str>
     </lst>
  </requestHandler>

And here is the debug output for a query.  (This was a test for synonyms,
which you'll see in the output.) The original query string was, of
course, "μ-heavy
chain disease"

You'll note that although there is no score in the first doc explain for
the actual querystring, the highlighted section does get a score for
product(double(category_weight)=1.5,const(20))

... which is the thing that is currently causing all the docs in the
collection to "match" even though the querystring is not in any of them.

"debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
"querystring":"\"μ-heavy
chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ heavy chain
disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" | (contentType:\"mu
heavy chain disease\")^1000.0)))/no_coord^1.1)
((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ heavy chain
disease\" | (contentType:\"μ heavy chain disease\")^1000.0)))/no_coord^1.1)
((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ heavy
chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy chain
disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy chain
disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
hcd\")))/no_coord^1.1)))
FunctionQuery(product(double(category_weight),const(20)))
FunctionQuery(product(query(+(title:\"μ heavy chain
disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((text:\"μ heavy
chain disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" | (contentType:\"μ
heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)
((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)))
product(double(category_weight),const(20)) product(query(+(title:\"μ heavy
chain disease\"),def=0.0),const(4))", "explain":{ "
33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true, "value":30.0, "
description":"sum of:", "details":[{ "match":true, "value":30.0, "
description":"FunctionQuery(product(double(category_weight),const(20))),
product of:",
=====================================================
*"details":**[{ "match":true, "value":30.0,
"description":"product(double(category_weight)=1.5,const(20))"}, {*
=====================================================

"match":true, "value":1.0, "description":"boost"}, { "match":true, "value":
1.0, "description":"queryNorm"}]}, {

Re: Want zero results from SOLR when there are no matches for "querystring"

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
Thanks - I'll look at it...

On Fri, Aug 12, 2016 at 1:21 PM, Erick Erickson <er...@gmail.com>
wrote:

> Maybe rerankqparserplugin?
>
> On Aug 12, 2016 11:54, "John Bickerstaff" <jo...@johnbickerstaff.com>
> wrote:
>
> > @Hossman --  thanks again.
> >
> > I've made the following change and so far things look good.  I couldn't
> see
> > debug or find results for what I put in for $func, so I just removed it,
> > but making modifications as you suggested appears to be working.
> >
> > Including the actual line from my endpoint XML in case this thread helps
> > someone else...
> >
> > <str name="q">{!boost defType=synonym_edismax qf='title' synonyms='true'
> > synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
> > v=$q}</str>
> >
> > On Fri, Aug 12, 2016 at 12:09 PM, John Bickerstaff <
> > john@johnbickerstaff.com
> > > wrote:
> >
> > > Thanks!  I'll check it out.
> > >
> > > On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <susheel2777@gmail.com
> >
> > > wrote:
> > >
> > >> Not exactly sure what you are looking from chaining the results but
> > >> similar
> > >> functionality is available in Streaming expressions where result of
> > inner
> > >> expressions are passed to outer expressions and so on
> > >> https://cwiki.apache.org/confluence/display/solr/
> Streaming+Expressions
> > >>
> > >> HTH
> > >> Susheel
> > >>
> > >> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <
> > >> john@johnbickerstaff.com>
> > >> wrote:
> > >>
> > >> > Hossman - many thanks again for your comprehensive and very helpful
> > >> answer!
> > >> >
> > >> > All,
> > >> >
> > >> > I am (possibly mis-remembering) reading something about being able
> to
> > >> pass
> > >> > the results of one query to another query...  Essentially "chaining"
> > >> result
> > >> > sets.
> > >> >
> > >> > I have looked in docs and can't find anything on a quick search -- I
> > may
> > >> > have been reading about the Re-Ranking feature, which doesn't help
> me
> > (I
> > >> > know because I just tried and it seems to return all results anyway,
> > >> just
> > >> > re-ranking the number specified in the reRankDocs flag...)
> > >> >
> > >> > Is there a way to (cleanly) send the results of one query to another
> > >> query
> > >> > for further processing?  Essentially, pass ONLY the results
> (including
> > >> an
> > >> > empty set of results) to another query for processing?
> > >> >
> > >> > thanks...
> > >> >
> > >> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
> > >> > john@johnbickerstaff.com>
> > >> > wrote:
> > >> >
> > >> > > Thanks!
> > >> > >
> > >> > > To answer your questions, while I digest the rest of that
> > >> information...
> > >> > >
> > >> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> > >> > > https://github.com/healthonnet/hon-lucene-synonyms
> > >> > >
> > >> > > The config looks like this - and IIRC, is simply a copy from the
> > >> > > recommended cofig on the site mentioned above.
> > >> > >
> > >> > >  <queryParser name="synonym_edismax"
> class="com.github.healthonnet.
> > >> > search.
> > >> > > SynonymExpandingExtendedDismaxQParserPlugin">
> > >> > >     <!-- You can define more than one synonym analyzer in the
> > >> following
> > >> > > list.
> > >> > >          For example, you might have one set of synonyms for
> > English,
> > >> one
> > >> > > for French,
> > >> > >          one for Spanish, etc.
> > >> > >       -->
> > >> > >     <lst name="synonymAnalyzers">
> > >> > >       <!-- Name your analyzer something useful, e.g.
> "analyzer_en",
> > >> > > "analyzer_fr", "analyzer_es", etc.
> > >> > >            If you only have one, the name doesn't matter (hence
> > >> > > "myCoolAnalyzer").
> > >> > >         -->
> > >> > >       <lst name="myCoolAnalyzer">
> > >> > >         <!-- We recommend a PatternTokenizerFactory that tokenizes
> > >> based
> > >> > > on whitespace and quotes.
> > >> > >              This seems to work best with most people's synonym
> > files.
> > >> > >              For details, read the discussion here:
> > >> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
> > >> > >           -->
> > >> > >         <lst name="tokenizer">
> > >> > >           <str name="class">solr.PatternTokenizerFactory</str>
> > >> > >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
> > >> > >         </lst>
> > >> > >         <!-- The ShingleFilterFactory outputs synonyms of multiple
> > >> token
> > >> > > lengths (e.g. unigrams, bigrams, trigrams, etc.).
> > >> > >              The default here is to assume you don't have any
> > synonyms
> > >> > > longer than 4 tokens.
> > >> > >              You can tweak this depending on what your synonyms
> look
> > >> > like.
> > >> > > E.g. if you only have unigrams, you can remove
> > >> > >              it entirely, and if your synonyms are up to 7 tokens
> in
> > >> > > length, you should set the maxShingleSize to 7.
> > >> > >           -->
> > >> > >         <lst name="filter">
> > >> > >           <str name="class">solr.ShingleFilterFactory</str>
> > >> > >           <str name="outputUnigramsIfNoShingles">true</str>
> > >> > >           <str name="outputUnigrams">true</str>
> > >> > >           <str name="minShingleSize">2</str>
> > >> > >           <str name="maxShingleSize">4</str>
> > >> > >         </lst>
> > >> > >         <!-- This is where you set your synonym file.  For the
> unit
> > >> tests
> > >> > > and "Getting Started" examples, we use example_synonym_file.txt.
> > >> > >              This plugin will work best if you keep expand set to
> > true
> > >> > and
> > >> > > have all your synonyms comma-separated (rather than =>-separated).
> > >> > >           -->
> > >> > >         <lst name="filter">
> > >> > >           <str name="class">solr.SynonymFilterFactory</str>
> > >> > >           <str name="tokenizerFactory">solr.
> > >> > KeywordTokenizerFactory</str>
> > >> > >           <str name="synonyms">example_synonym_file.txt</str>
> > >> > >           <str name="expand">true</str>
> > >> > >           <str name="ignoreCase">true</str>
> > >> > >         </lst>
> > >> > >       </lst>
> > >> > >     </lst>
> > >> > >   </queryParser>
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
> > >> > hossman_lucene@fucit.org
> > >> > > > wrote:
> > >> > >
> > >> > >>
> > >> > >> : First let me say that this is very possibly the "x - y problem"
> > so
> > >> let
> > >> > >> me
> > >> > >> : state up front what my ultimate need is -- then I'll ask about
> > the
> > >> > >> thing I
> > >> > >> : imagine might help...  which, of course, is heavily biased in
> the
> > >> > >> direction
> > >> > >> : of my experience coding Java and writing SQL...
> > >> > >>
> > >> > >> Thank you so much for asking your question this way!
> > >> > >>
> > >> > >> Right off the bat, the background you've provided seems
> > supicious...
> > >> > >>
> > >> > >> : I have a piece of a query that calculates a score based on a
> > >> > "weighting"
> > >> > >>         ...
> > >> > >> : The specific line is this:
> > >> > >> : <str name="bf">product(field(category_weight),20)</str>
> > >> > >> :
> > >> > >> : What I just realized is that when I query Solr for a string
> that
> > >> has
> > >> > NO
> > >> > >> : matches in the entire corpus, I still get a slew of results
> > because
> > >> > >> EVERY
> > >> > >> : doc has the weighting value in the category_weight field - and
> > >> > therefore
> > >> > >> : every doc gets some score.
> > >> > >>
> > >> > >> ...that is *NOT* how dismax and edisamx normally work.
> > >> > >>
> > >> > >> While both the "bf" abd "bq" params result in "additive"
> boosting,
> > >> and
> > >> > the
> > >> > >> implementation of that "additive boost" comes from adding new
> > >> optional
> > >> > >> clauses to the top level BooleanQuery that is executed, that only
> > >> > happens
> > >> > >> after the "main" query (from your "q" param) is added to that top
> > >> level
> > >> > >> BooleanQuery as a "mandaory" clause.
> > >> > >>
> > >> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost
> > every
> > >> > doc,
> > >> > >> but with the techprducts configs/data these requests still don't
> > >> match
> > >> > >> anything...
> > >> > >>
> > >> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> > >> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
> > >> > >>
> > >> > >> ...and if you look at the debug output, the parsed queries shows
> > that
> > >> > the
> > >> > >> "bogus" part of the query is mandatory...
> > >> > >>
> > >> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> > >> > >> FunctionQuery(const(true))
> > >> > >>
> > >> > >> (i didn't use "pf" in that example, but the effect is the same,
> the
> > >> "pf"
> > >> > >> based clauses are optional, while the "qf" based clauses are
> > >> mandatory)
> > >> > >>
> > >> > >> If you compare that example to your debug output, you'll notice a
> > >> > >> difference in structure -- it's a bit hard to see in your
> example,
> > >> but
> > >> > if
> > >> > >> you simplify your qf, pf, and q fields it should be more obvious,
> > but
> > >> > >> AFAICT the "main" parts of your query are getting wrapped in an
> > extra
> > >> > >> layer of parents (ie: an extra BooleanQuery) which is *not*
> > >> mandatory in
> > >> > >> the top level query ... i don't see *any* mandatory clauses in
> your
> > >> top
> > >> > >> level BooleanQuery, which is why any match on a bf or bq function
> > is
> > >> > >> enough to cause a document to match.
> > >> > >>
> > >> > >> I suspect the reason your parsed query structure is so diff has
> to
> > do
> > >> > with
> > >> > >> this...
> > >> > >>
> > >> > >> :        <str name="defType">synonym_edismax</str>>
> > >> > >>
> > >> > >>
> > >> > >> 1) how exactly is "synonym_edismax" defined in your
> solrconfig.xml?
> > >> > >> 2) what QParserPlugin are you using to implement that?
> > >> > >>
> > >> > >> I suspect whatever QParserPlugin you are using has a bug in it :)
> > >> > >>
> > >> > >>
> > >> > >> If you can't fix the bug, one possibile workaround would be to
> > >> abandon
> > >> > bf
> > >> > >> and bq params completely, and instead wrap the query it produces
> in
> > >> in a
> > >> > >> {!boost} parser with whatever function you want (using functions
> > like
> > >> > >> sum() or prod() to combine multiple functions, and query() to
> > >> > incorporate
> > >> > >> your current bq param).  Doing this will require chanign how you
> > >> specify
> > >> > >> you input (example below) and it will result in *multiplicitive*
> > >> boosts
> > >> > --
> > >> > >> so your scores will be much diff, and you will likely have to
> > adjust
> > >> > your
> > >> > >> constants, but: 1) multiplicitive boosts are almost always what
> > >> people
> > >> > >> *really* want anyway; 2) it will ensure the boosts are only
> applied
> > >> for
> > >> > >> things matching your main query, no matter how that query parser
> > >> works
> > >> > or
> > >> > >> what bugs it has.
> > >> > >>
> > >> > >> Example of using {!boost} to wrap an arbitrary other parser...
> > >> > >>
> > >> > >> instead of...
> > >> > >>   defType=foofoo
> > >> > >>   q=barbarbar
> > >> > >>
> > >> > >> use...
> > >> > >>    q={!boost b=$func defType=foofoo v=$qq}
> > >> > >>   qq=barbarbar
> > >> > >> func=sum(something,somethingelse)
> > >> > >>
> > >> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
> > >> > >> https://cwiki.apache.org/confluence/display/solr/
> Function+Queries
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> :
> > >> > >> : What I would like is to return zero results if there is no
> match
> > >> for
> > >> > the
> > >> > >> : querystring.  My collection is small enough that I don't care
> if
> > >> the
> > >> > >> actual
> > >> > >> : calculation runs on each doc (although that's wasteful) -- I
> just
> > >> > don't
> > >> > >> : want to see results come back for zero matches to the
> querystring
> > >> > >> :
> > >> > >> : (The /select endpoint does this of course, but my custom
> endpoint
> > >> > >> includes
> > >> > >> : this "weighting" piece and therefore returns every doc in the
> > >> corpus
> > >> > >> : because they all have the weighting.
> > >> > >> :
> > >> > >> : ====================
> > >> > >> : Enter my imagined solution...  The potential X-Y problem...
> > >> > >> : ====================
> > >> > >> :
> > >> > >> : So - given that I come from a programming background, I
> > immediately
> > >> > >> start
> > >> > >> : thinking of an if statement ...
> > >> > >> :
> > >> > >> :      if(some_score_for_the_primary_search_string) {
> > >> > >> :           run_the_category_weight_calculation;
> > >> > >> :      } else {
> > >> > >> :           do_NOT_run_category_weight_calc;
> > >> > >> :      }
> > >> > >> :
> > >> > >> :
> > >> > >> : Another way of thinking of it would be something like the
> "WHERE"
> > >> > >> clause in
> > >> > >> : SQL...
> > >> > >> :
> > >> > >> :  run_category_weight_calculation WHERE "searchstring" is found
> > in
> > >> the
> > >> > >> : document, not otherwise.
> > >> > >> :
> > >> > >> : I'm aware that things could be handled in the client-side of my
> > web
> > >> > app,
> > >> > >> : but if possible, I'd like the interface to SOLR to be as clean
> as
> > >> > >> possible,
> > >> > >> : and massage incoming SOLR data as little as possible.
> > >> > >> :
> > >> > >> : In other words, do NOT return any docs if the querystring (and
> > any
> > >> > >> : synonyms) match zero docs.
> > >> > >> :
> > >> > >> : Here is the endpoint XML for the query.  I've highlighted the
> > >> specific
> > >> > >> line
> > >> > >> : that is causing the unintended results...
> > >> > >> :
> > >> > >> :
> > >> > >> :  <requestHandler name="/foo" class="solr.SearchHandler">
> > >> > >> :     <!-- default values for query parameters can be specified,
> > >> these
> > >> > >> :          will be overridden by parameters in the request
> > >> > >> :       -->
> > >> > >> :      <lst name="defaults">
> > >> > >> :        <str name="echoParams">all</str>
> > >> > >> :        <int name="rows">20</int>
> > >> > >> :        <!-- Query settings -->
> > >> > >> :        <str name="df">text</str>
> > >> > >> :       <!-- <str name="df">title</str> -->
> > >> > >> :        <str name="defType">synonym_edismax</str>>
> > >> > >> :        <str name="synonyms">true</str>
> > >> > >> :     <!-- The line below balances out the weighting of exact
> > >> matches to
> > >> > >> the
> > >> > >> : synonym phrase entered by the user
> > >> > >> :          with the category_weight calculation and the
> titleQuery
> > >> calc.
> > >> > >> : These numbers exist in a balance and
> > >> > >> :          if one is raised or lowered, the others (probably)
> need
> > to
> > >> > >> change
> > >> > >> : as well.  It may be better to go with decimals
> > >> > >> :          for all of them... .4 instead of 4 and 2 instead of 20
> > and
> > >> > 2.5
> > >> > >> : instead of 25.
> > >> > >> :          In the end, I'm not sure it really matters, but don't
> > >> change
> > >> > >> one
> > >> > >> : without changing the others
> > >> > >> :          unless you've tested and are sure you want the results
> > >> -->
> > >> > >> :        <float name="synonyms.originalBoost">1.5</float>
> > >> > >> :        <float name="synonyms.synonymBoost">1.1</float>
> > >> > >> :        <str name="mm">75%</str>
> > >> > >> :        <str name="q.alt">*:*</str>
> > >> > >> :        <str name="rows">20</str>
> > >> > >> :        <str name="fq">meta_doc_type:chapterDoc</str>
> > >> > >> :        <str name="bq">{!synonym_edismax qf='title'
> > synonyms='true'
> > >> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf=''
> > >> bq=''
> > >> > >> : v=$q}</str>
> > >> > >> :        <str name="fl">id category_weight title category_ss
> score
> > >> > >> : contentType</str>
> > >> > >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
> > >> > >> v=$q}</str>
> > >> > >> : =====================================================
> > >> > >> :        *<str name="bf">product(field(
> category_weight),20)</str>*
> > >> > >> : =====================================================
> > >> > >> :        <str name="bf">product(query($titleQuery),4)</str>
> > >> > >> :        <str name="qf">text contentType^1000</str>
> > >> > >> :        <str name="wt">python</str>
> > >> > >> :        <str name="debug">true</str>
> > >> > >> :        <str name="debug.explain.structured">true</str>
> > >> > >> :        <str name="indent">true</str>
> > >> > >> :        <str name="echoParams">all</str>
> > >> > >> :      </lst>
> > >> > >> :   </requestHandler>
> > >> > >> :
> > >> > >> : And here is the debug output for a query.  (This was a test for
> > >> > >> synonyms,
> > >> > >> : which you'll see in the output.) The original query string was,
> > of
> > >> > >> : course, "μ-heavy
> > >> > >> : chain disease"
> > >> > >> :
> > >> > >> : You'll note that although there is no score in the first doc
> > >> explain
> > >> > for
> > >> > >> : the actual querystring, the highlighted section does get a
> score
> > >> for
> > >> > >> : product(double(category_weight)=1.5,const(20))
> > >> > >> :
> > >> > >> : ... which is the thing that is currently causing all the docs
> in
> > >> the
> > >> > >> : collection to "match" even though the querystring is not in any
> > of
> > >> > them.
> > >> > >> :
> > >> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
> > >> > >> : "querystring":"\"μ-heavy
> > >> > >> : chain disease\"", "parsedquery":"(
> DisjunctionMaxQuery((text:\"μ
> > >> heavy
> > >> > >> chain
> > >> > >> : disease\" | (contentType:\"μ heavy chain
> disease\")^1000.0))^1.5
> > >> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
> > >> > >> (contentType:\"mu
> > >> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
> > >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> > >> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\
> "μ
> > >> heavy
> > >> > >> chain
> > >> > >> : disease\" | (contentType:\"μ heavy chain
> > >> > disease\")^1000.0)))/no_coord^
> > >> > >> 1.1)
> > >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> > >> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\
> "μ
> > >> > heavy
> > >> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy
> > >> chain
> > >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> > >> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy
> > >> chain
> > >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> > >> > >> : hcd\")))/no_coord^1.1)))
> > >> > >> : FunctionQuery(product(double(category_weight),const(20)))
> > >> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain
> > >> > >> : disease\"),def=0.0),const(4)))",
> "parsedquery_toString":"(((tex
> > >> t:\"μ
> > >> > >> heavy
> > >> > >> : chain disease\" | (contentType:\"μ heavy chain
> > >> disease\")^1000.0))^1.5
> > >> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy
> > chain
> > >> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
> > >> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
> > >> > >> (contentType:\"μ
> > >> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
> > >> > >> (contentType:\"μ
> > >> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
> > >> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ
> > >> hcd\"))^1.1)
> > >> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
> > >> > hcd\"))^1.1)))
> > >> > >> : product(double(category_weight),const(20))
> > >> product(query(+(title:\"μ
> > >> > >> heavy
> > >> > >> : chain disease\"),def=0.0),const(4))", "explain":{ "
> > >> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true,
> > >> "value":30.0, "
> > >> > >> : description":"sum of:", "details":[{ "match":true,
> "value":30.0,
> > "
> > >> > >> : description":"FunctionQuery(product(double(category_weight),
> > >> > >> const(20))),
> > >> > >> : product of:",
> > >> > >> : =====================================================
> > >> > >> : *"details":**[{ "match":true, "value":30.0,
> > >> > >> : "description":"product(double(category_weight)=1.5,const(20)
> )"},
> > >> {*
> > >> > >> : =====================================================
> > >> > >> :
> > >> > >> : "match":true, "value":1.0, "description":"boost"}, {
> > "match":true,
> > >> > >> "value":
> > >> > >> : 1.0, "description":"queryNorm"}]}, {
> > >> > >> :
> > >> > >>
> > >> > >> -Hoss
> > >> > >> http://www.lucidworks.com/
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Posted by Erick Erickson <er...@gmail.com>.
Maybe rerankqparserplugin?

On Aug 12, 2016 11:54, "John Bickerstaff" <jo...@johnbickerstaff.com> wrote:

> @Hossman --  thanks again.
>
> I've made the following change and so far things look good.  I couldn't see
> debug or find results for what I put in for $func, so I just removed it,
> but making modifications as you suggested appears to be working.
>
> Including the actual line from my endpoint XML in case this thread helps
> someone else...
>
> <str name="q">{!boost defType=synonym_edismax qf='title' synonyms='true'
> synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
> v=$q}</str>
>
> On Fri, Aug 12, 2016 at 12:09 PM, John Bickerstaff <
> john@johnbickerstaff.com
> > wrote:
>
> > Thanks!  I'll check it out.
> >
> > On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <su...@gmail.com>
> > wrote:
> >
> >> Not exactly sure what you are looking from chaining the results but
> >> similar
> >> functionality is available in Streaming expressions where result of
> inner
> >> expressions are passed to outer expressions and so on
> >> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> >>
> >> HTH
> >> Susheel
> >>
> >> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <
> >> john@johnbickerstaff.com>
> >> wrote:
> >>
> >> > Hossman - many thanks again for your comprehensive and very helpful
> >> answer!
> >> >
> >> > All,
> >> >
> >> > I am (possibly mis-remembering) reading something about being able to
> >> pass
> >> > the results of one query to another query...  Essentially "chaining"
> >> result
> >> > sets.
> >> >
> >> > I have looked in docs and can't find anything on a quick search -- I
> may
> >> > have been reading about the Re-Ranking feature, which doesn't help me
> (I
> >> > know because I just tried and it seems to return all results anyway,
> >> just
> >> > re-ranking the number specified in the reRankDocs flag...)
> >> >
> >> > Is there a way to (cleanly) send the results of one query to another
> >> query
> >> > for further processing?  Essentially, pass ONLY the results (including
> >> an
> >> > empty set of results) to another query for processing?
> >> >
> >> > thanks...
> >> >
> >> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
> >> > john@johnbickerstaff.com>
> >> > wrote:
> >> >
> >> > > Thanks!
> >> > >
> >> > > To answer your questions, while I digest the rest of that
> >> information...
> >> > >
> >> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> >> > > https://github.com/healthonnet/hon-lucene-synonyms
> >> > >
> >> > > The config looks like this - and IIRC, is simply a copy from the
> >> > > recommended cofig on the site mentioned above.
> >> > >
> >> > >  <queryParser name="synonym_edismax" class="com.github.healthonnet.
> >> > search.
> >> > > SynonymExpandingExtendedDismaxQParserPlugin">
> >> > >     <!-- You can define more than one synonym analyzer in the
> >> following
> >> > > list.
> >> > >          For example, you might have one set of synonyms for
> English,
> >> one
> >> > > for French,
> >> > >          one for Spanish, etc.
> >> > >       -->
> >> > >     <lst name="synonymAnalyzers">
> >> > >       <!-- Name your analyzer something useful, e.g. "analyzer_en",
> >> > > "analyzer_fr", "analyzer_es", etc.
> >> > >            If you only have one, the name doesn't matter (hence
> >> > > "myCoolAnalyzer").
> >> > >         -->
> >> > >       <lst name="myCoolAnalyzer">
> >> > >         <!-- We recommend a PatternTokenizerFactory that tokenizes
> >> based
> >> > > on whitespace and quotes.
> >> > >              This seems to work best with most people's synonym
> files.
> >> > >              For details, read the discussion here:
> >> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
> >> > >           -->
> >> > >         <lst name="tokenizer">
> >> > >           <str name="class">solr.PatternTokenizerFactory</str>
> >> > >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
> >> > >         </lst>
> >> > >         <!-- The ShingleFilterFactory outputs synonyms of multiple
> >> token
> >> > > lengths (e.g. unigrams, bigrams, trigrams, etc.).
> >> > >              The default here is to assume you don't have any
> synonyms
> >> > > longer than 4 tokens.
> >> > >              You can tweak this depending on what your synonyms look
> >> > like.
> >> > > E.g. if you only have unigrams, you can remove
> >> > >              it entirely, and if your synonyms are up to 7 tokens in
> >> > > length, you should set the maxShingleSize to 7.
> >> > >           -->
> >> > >         <lst name="filter">
> >> > >           <str name="class">solr.ShingleFilterFactory</str>
> >> > >           <str name="outputUnigramsIfNoShingles">true</str>
> >> > >           <str name="outputUnigrams">true</str>
> >> > >           <str name="minShingleSize">2</str>
> >> > >           <str name="maxShingleSize">4</str>
> >> > >         </lst>
> >> > >         <!-- This is where you set your synonym file.  For the unit
> >> tests
> >> > > and "Getting Started" examples, we use example_synonym_file.txt.
> >> > >              This plugin will work best if you keep expand set to
> true
> >> > and
> >> > > have all your synonyms comma-separated (rather than =>-separated).
> >> > >           -->
> >> > >         <lst name="filter">
> >> > >           <str name="class">solr.SynonymFilterFactory</str>
> >> > >           <str name="tokenizerFactory">solr.
> >> > KeywordTokenizerFactory</str>
> >> > >           <str name="synonyms">example_synonym_file.txt</str>
> >> > >           <str name="expand">true</str>
> >> > >           <str name="ignoreCase">true</str>
> >> > >         </lst>
> >> > >       </lst>
> >> > >     </lst>
> >> > >   </queryParser>
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
> >> > hossman_lucene@fucit.org
> >> > > > wrote:
> >> > >
> >> > >>
> >> > >> : First let me say that this is very possibly the "x - y problem"
> so
> >> let
> >> > >> me
> >> > >> : state up front what my ultimate need is -- then I'll ask about
> the
> >> > >> thing I
> >> > >> : imagine might help...  which, of course, is heavily biased in the
> >> > >> direction
> >> > >> : of my experience coding Java and writing SQL...
> >> > >>
> >> > >> Thank you so much for asking your question this way!
> >> > >>
> >> > >> Right off the bat, the background you've provided seems
> supicious...
> >> > >>
> >> > >> : I have a piece of a query that calculates a score based on a
> >> > "weighting"
> >> > >>         ...
> >> > >> : The specific line is this:
> >> > >> : <str name="bf">product(field(category_weight),20)</str>
> >> > >> :
> >> > >> : What I just realized is that when I query Solr for a string that
> >> has
> >> > NO
> >> > >> : matches in the entire corpus, I still get a slew of results
> because
> >> > >> EVERY
> >> > >> : doc has the weighting value in the category_weight field - and
> >> > therefore
> >> > >> : every doc gets some score.
> >> > >>
> >> > >> ...that is *NOT* how dismax and edisamx normally work.
> >> > >>
> >> > >> While both the "bf" abd "bq" params result in "additive" boosting,
> >> and
> >> > the
> >> > >> implementation of that "additive boost" comes from adding new
> >> optional
> >> > >> clauses to the top level BooleanQuery that is executed, that only
> >> > happens
> >> > >> after the "main" query (from your "q" param) is added to that top
> >> level
> >> > >> BooleanQuery as a "mandaory" clause.
> >> > >>
> >> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost
> every
> >> > doc,
> >> > >> but with the techprducts configs/data these requests still don't
> >> match
> >> > >> anything...
> >> > >>
> >> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> >> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
> >> > >>
> >> > >> ...and if you look at the debug output, the parsed queries shows
> that
> >> > the
> >> > >> "bogus" part of the query is mandatory...
> >> > >>
> >> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> >> > >> FunctionQuery(const(true))
> >> > >>
> >> > >> (i didn't use "pf" in that example, but the effect is the same, the
> >> "pf"
> >> > >> based clauses are optional, while the "qf" based clauses are
> >> mandatory)
> >> > >>
> >> > >> If you compare that example to your debug output, you'll notice a
> >> > >> difference in structure -- it's a bit hard to see in your example,
> >> but
> >> > if
> >> > >> you simplify your qf, pf, and q fields it should be more obvious,
> but
> >> > >> AFAICT the "main" parts of your query are getting wrapped in an
> extra
> >> > >> layer of parents (ie: an extra BooleanQuery) which is *not*
> >> mandatory in
> >> > >> the top level query ... i don't see *any* mandatory clauses in your
> >> top
> >> > >> level BooleanQuery, which is why any match on a bf or bq function
> is
> >> > >> enough to cause a document to match.
> >> > >>
> >> > >> I suspect the reason your parsed query structure is so diff has to
> do
> >> > with
> >> > >> this...
> >> > >>
> >> > >> :        <str name="defType">synonym_edismax</str>>
> >> > >>
> >> > >>
> >> > >> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
> >> > >> 2) what QParserPlugin are you using to implement that?
> >> > >>
> >> > >> I suspect whatever QParserPlugin you are using has a bug in it :)
> >> > >>
> >> > >>
> >> > >> If you can't fix the bug, one possibile workaround would be to
> >> abandon
> >> > bf
> >> > >> and bq params completely, and instead wrap the query it produces in
> >> in a
> >> > >> {!boost} parser with whatever function you want (using functions
> like
> >> > >> sum() or prod() to combine multiple functions, and query() to
> >> > incorporate
> >> > >> your current bq param).  Doing this will require chanign how you
> >> specify
> >> > >> you input (example below) and it will result in *multiplicitive*
> >> boosts
> >> > --
> >> > >> so your scores will be much diff, and you will likely have to
> adjust
> >> > your
> >> > >> constants, but: 1) multiplicitive boosts are almost always what
> >> people
> >> > >> *really* want anyway; 2) it will ensure the boosts are only applied
> >> for
> >> > >> things matching your main query, no matter how that query parser
> >> works
> >> > or
> >> > >> what bugs it has.
> >> > >>
> >> > >> Example of using {!boost} to wrap an arbitrary other parser...
> >> > >>
> >> > >> instead of...
> >> > >>   defType=foofoo
> >> > >>   q=barbarbar
> >> > >>
> >> > >> use...
> >> > >>    q={!boost b=$func defType=foofoo v=$qq}
> >> > >>   qq=barbarbar
> >> > >> func=sum(something,somethingelse)
> >> > >>
> >> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
> >> > >> https://cwiki.apache.org/confluence/display/solr/Function+Queries
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> :
> >> > >> : What I would like is to return zero results if there is no match
> >> for
> >> > the
> >> > >> : querystring.  My collection is small enough that I don't care if
> >> the
> >> > >> actual
> >> > >> : calculation runs on each doc (although that's wasteful) -- I just
> >> > don't
> >> > >> : want to see results come back for zero matches to the querystring
> >> > >> :
> >> > >> : (The /select endpoint does this of course, but my custom endpoint
> >> > >> includes
> >> > >> : this "weighting" piece and therefore returns every doc in the
> >> corpus
> >> > >> : because they all have the weighting.
> >> > >> :
> >> > >> : ====================
> >> > >> : Enter my imagined solution...  The potential X-Y problem...
> >> > >> : ====================
> >> > >> :
> >> > >> : So - given that I come from a programming background, I
> immediately
> >> > >> start
> >> > >> : thinking of an if statement ...
> >> > >> :
> >> > >> :      if(some_score_for_the_primary_search_string) {
> >> > >> :           run_the_category_weight_calculation;
> >> > >> :      } else {
> >> > >> :           do_NOT_run_category_weight_calc;
> >> > >> :      }
> >> > >> :
> >> > >> :
> >> > >> : Another way of thinking of it would be something like the "WHERE"
> >> > >> clause in
> >> > >> : SQL...
> >> > >> :
> >> > >> :  run_category_weight_calculation WHERE "searchstring" is found
> in
> >> the
> >> > >> : document, not otherwise.
> >> > >> :
> >> > >> : I'm aware that things could be handled in the client-side of my
> web
> >> > app,
> >> > >> : but if possible, I'd like the interface to SOLR to be as clean as
> >> > >> possible,
> >> > >> : and massage incoming SOLR data as little as possible.
> >> > >> :
> >> > >> : In other words, do NOT return any docs if the querystring (and
> any
> >> > >> : synonyms) match zero docs.
> >> > >> :
> >> > >> : Here is the endpoint XML for the query.  I've highlighted the
> >> specific
> >> > >> line
> >> > >> : that is causing the unintended results...
> >> > >> :
> >> > >> :
> >> > >> :  <requestHandler name="/foo" class="solr.SearchHandler">
> >> > >> :     <!-- default values for query parameters can be specified,
> >> these
> >> > >> :          will be overridden by parameters in the request
> >> > >> :       -->
> >> > >> :      <lst name="defaults">
> >> > >> :        <str name="echoParams">all</str>
> >> > >> :        <int name="rows">20</int>
> >> > >> :        <!-- Query settings -->
> >> > >> :        <str name="df">text</str>
> >> > >> :       <!-- <str name="df">title</str> -->
> >> > >> :        <str name="defType">synonym_edismax</str>>
> >> > >> :        <str name="synonyms">true</str>
> >> > >> :     <!-- The line below balances out the weighting of exact
> >> matches to
> >> > >> the
> >> > >> : synonym phrase entered by the user
> >> > >> :          with the category_weight calculation and the titleQuery
> >> calc.
> >> > >> : These numbers exist in a balance and
> >> > >> :          if one is raised or lowered, the others (probably) need
> to
> >> > >> change
> >> > >> : as well.  It may be better to go with decimals
> >> > >> :          for all of them... .4 instead of 4 and 2 instead of 20
> and
> >> > 2.5
> >> > >> : instead of 25.
> >> > >> :          In the end, I'm not sure it really matters, but don't
> >> change
> >> > >> one
> >> > >> : without changing the others
> >> > >> :          unless you've tested and are sure you want the results
> >> -->
> >> > >> :        <float name="synonyms.originalBoost">1.5</float>
> >> > >> :        <float name="synonyms.synonymBoost">1.1</float>
> >> > >> :        <str name="mm">75%</str>
> >> > >> :        <str name="q.alt">*:*</str>
> >> > >> :        <str name="rows">20</str>
> >> > >> :        <str name="fq">meta_doc_type:chapterDoc</str>
> >> > >> :        <str name="bq">{!synonym_edismax qf='title'
> synonyms='true'
> >> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf=''
> >> bq=''
> >> > >> : v=$q}</str>
> >> > >> :        <str name="fl">id category_weight title category_ss score
> >> > >> : contentType</str>
> >> > >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
> >> > >> v=$q}</str>
> >> > >> : =====================================================
> >> > >> :        *<str name="bf">product(field(category_weight),20)</str>*
> >> > >> : =====================================================
> >> > >> :        <str name="bf">product(query($titleQuery),4)</str>
> >> > >> :        <str name="qf">text contentType^1000</str>
> >> > >> :        <str name="wt">python</str>
> >> > >> :        <str name="debug">true</str>
> >> > >> :        <str name="debug.explain.structured">true</str>
> >> > >> :        <str name="indent">true</str>
> >> > >> :        <str name="echoParams">all</str>
> >> > >> :      </lst>
> >> > >> :   </requestHandler>
> >> > >> :
> >> > >> : And here is the debug output for a query.  (This was a test for
> >> > >> synonyms,
> >> > >> : which you'll see in the output.) The original query string was,
> of
> >> > >> : course, "μ-heavy
> >> > >> : chain disease"
> >> > >> :
> >> > >> : You'll note that although there is no score in the first doc
> >> explain
> >> > for
> >> > >> : the actual querystring, the highlighted section does get a score
> >> for
> >> > >> : product(double(category_weight)=1.5,const(20))
> >> > >> :
> >> > >> : ... which is the thing that is currently causing all the docs in
> >> the
> >> > >> : collection to "match" even though the querystring is not in any
> of
> >> > them.
> >> > >> :
> >> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
> >> > >> : "querystring":"\"μ-heavy
> >> > >> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ
> >> heavy
> >> > >> chain
> >> > >> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
> >> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
> >> > >> (contentType:\"mu
> >> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
> >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> >> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ
> >> heavy
> >> > >> chain
> >> > >> : disease\" | (contentType:\"μ heavy chain
> >> > disease\")^1000.0)))/no_coord^
> >> > >> 1.1)
> >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> >> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ
> >> > heavy
> >> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy
> >> chain
> >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> >> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy
> >> chain
> >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> >> > >> : hcd\")))/no_coord^1.1)))
> >> > >> : FunctionQuery(product(double(category_weight),const(20)))
> >> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain
> >> > >> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((tex
> >> t:\"μ
> >> > >> heavy
> >> > >> : chain disease\" | (contentType:\"μ heavy chain
> >> disease\")^1000.0))^1.5
> >> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy
> chain
> >> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
> >> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
> >> > >> (contentType:\"μ
> >> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
> >> > >> (contentType:\"μ
> >> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
> >> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ
> >> hcd\"))^1.1)
> >> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
> >> > hcd\"))^1.1)))
> >> > >> : product(double(category_weight),const(20))
> >> product(query(+(title:\"μ
> >> > >> heavy
> >> > >> : chain disease\"),def=0.0),const(4))", "explain":{ "
> >> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true,
> >> "value":30.0, "
> >> > >> : description":"sum of:", "details":[{ "match":true, "value":30.0,
> "
> >> > >> : description":"FunctionQuery(product(double(category_weight),
> >> > >> const(20))),
> >> > >> : product of:",
> >> > >> : =====================================================
> >> > >> : *"details":**[{ "match":true, "value":30.0,
> >> > >> : "description":"product(double(category_weight)=1.5,const(20))"},
> >> {*
> >> > >> : =====================================================
> >> > >> :
> >> > >> : "match":true, "value":1.0, "description":"boost"}, {
> "match":true,
> >> > >> "value":
> >> > >> : 1.0, "description":"queryNorm"}]}, {
> >> > >> :
> >> > >>
> >> > >> -Hoss
> >> > >> http://www.lucidworks.com/
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
@Hossman --  thanks again.

I've made the following change and so far things look good.  I couldn't see
debug or find results for what I put in for $func, so I just removed it,
but making modifications as you suggested appears to be working.

Including the actual line from my endpoint XML in case this thread helps
someone else...

<str name="q">{!boost defType=synonym_edismax qf='title' synonyms='true'
synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
v=$q}</str>

On Fri, Aug 12, 2016 at 12:09 PM, John Bickerstaff <john@johnbickerstaff.com
> wrote:

> Thanks!  I'll check it out.
>
> On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <su...@gmail.com>
> wrote:
>
>> Not exactly sure what you are looking from chaining the results but
>> similar
>> functionality is available in Streaming expressions where result of inner
>> expressions are passed to outer expressions and so on
>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>>
>> HTH
>> Susheel
>>
>> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <
>> john@johnbickerstaff.com>
>> wrote:
>>
>> > Hossman - many thanks again for your comprehensive and very helpful
>> answer!
>> >
>> > All,
>> >
>> > I am (possibly mis-remembering) reading something about being able to
>> pass
>> > the results of one query to another query...  Essentially "chaining"
>> result
>> > sets.
>> >
>> > I have looked in docs and can't find anything on a quick search -- I may
>> > have been reading about the Re-Ranking feature, which doesn't help me (I
>> > know because I just tried and it seems to return all results anyway,
>> just
>> > re-ranking the number specified in the reRankDocs flag...)
>> >
>> > Is there a way to (cleanly) send the results of one query to another
>> query
>> > for further processing?  Essentially, pass ONLY the results (including
>> an
>> > empty set of results) to another query for processing?
>> >
>> > thanks...
>> >
>> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
>> > john@johnbickerstaff.com>
>> > wrote:
>> >
>> > > Thanks!
>> > >
>> > > To answer your questions, while I digest the rest of that
>> information...
>> > >
>> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
>> > > https://github.com/healthonnet/hon-lucene-synonyms
>> > >
>> > > The config looks like this - and IIRC, is simply a copy from the
>> > > recommended cofig on the site mentioned above.
>> > >
>> > >  <queryParser name="synonym_edismax" class="com.github.healthonnet.
>> > search.
>> > > SynonymExpandingExtendedDismaxQParserPlugin">
>> > >     <!-- You can define more than one synonym analyzer in the
>> following
>> > > list.
>> > >          For example, you might have one set of synonyms for English,
>> one
>> > > for French,
>> > >          one for Spanish, etc.
>> > >       -->
>> > >     <lst name="synonymAnalyzers">
>> > >       <!-- Name your analyzer something useful, e.g. "analyzer_en",
>> > > "analyzer_fr", "analyzer_es", etc.
>> > >            If you only have one, the name doesn't matter (hence
>> > > "myCoolAnalyzer").
>> > >         -->
>> > >       <lst name="myCoolAnalyzer">
>> > >         <!-- We recommend a PatternTokenizerFactory that tokenizes
>> based
>> > > on whitespace and quotes.
>> > >              This seems to work best with most people's synonym files.
>> > >              For details, read the discussion here:
>> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
>> > >           -->
>> > >         <lst name="tokenizer">
>> > >           <str name="class">solr.PatternTokenizerFactory</str>
>> > >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
>> > >         </lst>
>> > >         <!-- The ShingleFilterFactory outputs synonyms of multiple
>> token
>> > > lengths (e.g. unigrams, bigrams, trigrams, etc.).
>> > >              The default here is to assume you don't have any synonyms
>> > > longer than 4 tokens.
>> > >              You can tweak this depending on what your synonyms look
>> > like.
>> > > E.g. if you only have unigrams, you can remove
>> > >              it entirely, and if your synonyms are up to 7 tokens in
>> > > length, you should set the maxShingleSize to 7.
>> > >           -->
>> > >         <lst name="filter">
>> > >           <str name="class">solr.ShingleFilterFactory</str>
>> > >           <str name="outputUnigramsIfNoShingles">true</str>
>> > >           <str name="outputUnigrams">true</str>
>> > >           <str name="minShingleSize">2</str>
>> > >           <str name="maxShingleSize">4</str>
>> > >         </lst>
>> > >         <!-- This is where you set your synonym file.  For the unit
>> tests
>> > > and "Getting Started" examples, we use example_synonym_file.txt.
>> > >              This plugin will work best if you keep expand set to true
>> > and
>> > > have all your synonyms comma-separated (rather than =>-separated).
>> > >           -->
>> > >         <lst name="filter">
>> > >           <str name="class">solr.SynonymFilterFactory</str>
>> > >           <str name="tokenizerFactory">solr.
>> > KeywordTokenizerFactory</str>
>> > >           <str name="synonyms">example_synonym_file.txt</str>
>> > >           <str name="expand">true</str>
>> > >           <str name="ignoreCase">true</str>
>> > >         </lst>
>> > >       </lst>
>> > >     </lst>
>> > >   </queryParser>
>> > >
>> > >
>> > >
>> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
>> > hossman_lucene@fucit.org
>> > > > wrote:
>> > >
>> > >>
>> > >> : First let me say that this is very possibly the "x - y problem" so
>> let
>> > >> me
>> > >> : state up front what my ultimate need is -- then I'll ask about the
>> > >> thing I
>> > >> : imagine might help...  which, of course, is heavily biased in the
>> > >> direction
>> > >> : of my experience coding Java and writing SQL...
>> > >>
>> > >> Thank you so much for asking your question this way!
>> > >>
>> > >> Right off the bat, the background you've provided seems supicious...
>> > >>
>> > >> : I have a piece of a query that calculates a score based on a
>> > "weighting"
>> > >>         ...
>> > >> : The specific line is this:
>> > >> : <str name="bf">product(field(category_weight),20)</str>
>> > >> :
>> > >> : What I just realized is that when I query Solr for a string that
>> has
>> > NO
>> > >> : matches in the entire corpus, I still get a slew of results because
>> > >> EVERY
>> > >> : doc has the weighting value in the category_weight field - and
>> > therefore
>> > >> : every doc gets some score.
>> > >>
>> > >> ...that is *NOT* how dismax and edisamx normally work.
>> > >>
>> > >> While both the "bf" abd "bq" params result in "additive" boosting,
>> and
>> > the
>> > >> implementation of that "additive boost" comes from adding new
>> optional
>> > >> clauses to the top level BooleanQuery that is executed, that only
>> > happens
>> > >> after the "main" query (from your "q" param) is added to that top
>> level
>> > >> BooleanQuery as a "mandaory" clause.
>> > >>
>> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost every
>> > doc,
>> > >> but with the techprducts configs/data these requests still don't
>> match
>> > >> anything...
>> > >>
>> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
>> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
>> > >>
>> > >> ...and if you look at the debug output, the parsed queries shows that
>> > the
>> > >> "bogus" part of the query is mandatory...
>> > >>
>> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
>> > >> FunctionQuery(const(true))
>> > >>
>> > >> (i didn't use "pf" in that example, but the effect is the same, the
>> "pf"
>> > >> based clauses are optional, while the "qf" based clauses are
>> mandatory)
>> > >>
>> > >> If you compare that example to your debug output, you'll notice a
>> > >> difference in structure -- it's a bit hard to see in your example,
>> but
>> > if
>> > >> you simplify your qf, pf, and q fields it should be more obvious, but
>> > >> AFAICT the "main" parts of your query are getting wrapped in an extra
>> > >> layer of parents (ie: an extra BooleanQuery) which is *not*
>> mandatory in
>> > >> the top level query ... i don't see *any* mandatory clauses in your
>> top
>> > >> level BooleanQuery, which is why any match on a bf or bq function is
>> > >> enough to cause a document to match.
>> > >>
>> > >> I suspect the reason your parsed query structure is so diff has to do
>> > with
>> > >> this...
>> > >>
>> > >> :        <str name="defType">synonym_edismax</str>>
>> > >>
>> > >>
>> > >> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
>> > >> 2) what QParserPlugin are you using to implement that?
>> > >>
>> > >> I suspect whatever QParserPlugin you are using has a bug in it :)
>> > >>
>> > >>
>> > >> If you can't fix the bug, one possibile workaround would be to
>> abandon
>> > bf
>> > >> and bq params completely, and instead wrap the query it produces in
>> in a
>> > >> {!boost} parser with whatever function you want (using functions like
>> > >> sum() or prod() to combine multiple functions, and query() to
>> > incorporate
>> > >> your current bq param).  Doing this will require chanign how you
>> specify
>> > >> you input (example below) and it will result in *multiplicitive*
>> boosts
>> > --
>> > >> so your scores will be much diff, and you will likely have to adjust
>> > your
>> > >> constants, but: 1) multiplicitive boosts are almost always what
>> people
>> > >> *really* want anyway; 2) it will ensure the boosts are only applied
>> for
>> > >> things matching your main query, no matter how that query parser
>> works
>> > or
>> > >> what bugs it has.
>> > >>
>> > >> Example of using {!boost} to wrap an arbitrary other parser...
>> > >>
>> > >> instead of...
>> > >>   defType=foofoo
>> > >>   q=barbarbar
>> > >>
>> > >> use...
>> > >>    q={!boost b=$func defType=foofoo v=$qq}
>> > >>   qq=barbarbar
>> > >> func=sum(something,somethingelse)
>> > >>
>> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
>> > >> https://cwiki.apache.org/confluence/display/solr/Function+Queries
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> :
>> > >> : What I would like is to return zero results if there is no match
>> for
>> > the
>> > >> : querystring.  My collection is small enough that I don't care if
>> the
>> > >> actual
>> > >> : calculation runs on each doc (although that's wasteful) -- I just
>> > don't
>> > >> : want to see results come back for zero matches to the querystring
>> > >> :
>> > >> : (The /select endpoint does this of course, but my custom endpoint
>> > >> includes
>> > >> : this "weighting" piece and therefore returns every doc in the
>> corpus
>> > >> : because they all have the weighting.
>> > >> :
>> > >> : ====================
>> > >> : Enter my imagined solution...  The potential X-Y problem...
>> > >> : ====================
>> > >> :
>> > >> : So - given that I come from a programming background, I immediately
>> > >> start
>> > >> : thinking of an if statement ...
>> > >> :
>> > >> :      if(some_score_for_the_primary_search_string) {
>> > >> :           run_the_category_weight_calculation;
>> > >> :      } else {
>> > >> :           do_NOT_run_category_weight_calc;
>> > >> :      }
>> > >> :
>> > >> :
>> > >> : Another way of thinking of it would be something like the "WHERE"
>> > >> clause in
>> > >> : SQL...
>> > >> :
>> > >> :  run_category_weight_calculation WHERE "searchstring" is found in
>> the
>> > >> : document, not otherwise.
>> > >> :
>> > >> : I'm aware that things could be handled in the client-side of my web
>> > app,
>> > >> : but if possible, I'd like the interface to SOLR to be as clean as
>> > >> possible,
>> > >> : and massage incoming SOLR data as little as possible.
>> > >> :
>> > >> : In other words, do NOT return any docs if the querystring (and any
>> > >> : synonyms) match zero docs.
>> > >> :
>> > >> : Here is the endpoint XML for the query.  I've highlighted the
>> specific
>> > >> line
>> > >> : that is causing the unintended results...
>> > >> :
>> > >> :
>> > >> :  <requestHandler name="/foo" class="solr.SearchHandler">
>> > >> :     <!-- default values for query parameters can be specified,
>> these
>> > >> :          will be overridden by parameters in the request
>> > >> :       -->
>> > >> :      <lst name="defaults">
>> > >> :        <str name="echoParams">all</str>
>> > >> :        <int name="rows">20</int>
>> > >> :        <!-- Query settings -->
>> > >> :        <str name="df">text</str>
>> > >> :       <!-- <str name="df">title</str> -->
>> > >> :        <str name="defType">synonym_edismax</str>>
>> > >> :        <str name="synonyms">true</str>
>> > >> :     <!-- The line below balances out the weighting of exact
>> matches to
>> > >> the
>> > >> : synonym phrase entered by the user
>> > >> :          with the category_weight calculation and the titleQuery
>> calc.
>> > >> : These numbers exist in a balance and
>> > >> :          if one is raised or lowered, the others (probably) need to
>> > >> change
>> > >> : as well.  It may be better to go with decimals
>> > >> :          for all of them... .4 instead of 4 and 2 instead of 20 and
>> > 2.5
>> > >> : instead of 25.
>> > >> :          In the end, I'm not sure it really matters, but don't
>> change
>> > >> one
>> > >> : without changing the others
>> > >> :          unless you've tested and are sure you want the results
>> -->
>> > >> :        <float name="synonyms.originalBoost">1.5</float>
>> > >> :        <float name="synonyms.synonymBoost">1.1</float>
>> > >> :        <str name="mm">75%</str>
>> > >> :        <str name="q.alt">*:*</str>
>> > >> :        <str name="rows">20</str>
>> > >> :        <str name="fq">meta_doc_type:chapterDoc</str>
>> > >> :        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
>> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf=''
>> bq=''
>> > >> : v=$q}</str>
>> > >> :        <str name="fl">id category_weight title category_ss score
>> > >> : contentType</str>
>> > >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
>> > >> v=$q}</str>
>> > >> : =====================================================
>> > >> :        *<str name="bf">product(field(category_weight),20)</str>*
>> > >> : =====================================================
>> > >> :        <str name="bf">product(query($titleQuery),4)</str>
>> > >> :        <str name="qf">text contentType^1000</str>
>> > >> :        <str name="wt">python</str>
>> > >> :        <str name="debug">true</str>
>> > >> :        <str name="debug.explain.structured">true</str>
>> > >> :        <str name="indent">true</str>
>> > >> :        <str name="echoParams">all</str>
>> > >> :      </lst>
>> > >> :   </requestHandler>
>> > >> :
>> > >> : And here is the debug output for a query.  (This was a test for
>> > >> synonyms,
>> > >> : which you'll see in the output.) The original query string was, of
>> > >> : course, "μ-heavy
>> > >> : chain disease"
>> > >> :
>> > >> : You'll note that although there is no score in the first doc
>> explain
>> > for
>> > >> : the actual querystring, the highlighted section does get a score
>> for
>> > >> : product(double(category_weight)=1.5,const(20))
>> > >> :
>> > >> : ... which is the thing that is currently causing all the docs in
>> the
>> > >> : collection to "match" even though the querystring is not in any of
>> > them.
>> > >> :
>> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
>> > >> : "querystring":"\"μ-heavy
>> > >> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ
>> heavy
>> > >> chain
>> > >> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
>> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
>> > >> (contentType:\"mu
>> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
>> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
>> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ
>> heavy
>> > >> chain
>> > >> : disease\" | (contentType:\"μ heavy chain
>> > disease\")^1000.0)))/no_coord^
>> > >> 1.1)
>> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
>> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ
>> > heavy
>> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy
>> chain
>> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
>> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy
>> chain
>> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
>> > >> : hcd\")))/no_coord^1.1)))
>> > >> : FunctionQuery(product(double(category_weight),const(20)))
>> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain
>> > >> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((tex
>> t:\"μ
>> > >> heavy
>> > >> : chain disease\" | (contentType:\"μ heavy chain
>> disease\")^1000.0))^1.5
>> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
>> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
>> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
>> > >> (contentType:\"μ
>> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
>> > >> (contentType:\"μ
>> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
>> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ
>> hcd\"))^1.1)
>> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
>> > hcd\"))^1.1)))
>> > >> : product(double(category_weight),const(20))
>> product(query(+(title:\"μ
>> > >> heavy
>> > >> : chain disease\"),def=0.0),const(4))", "explain":{ "
>> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true,
>> "value":30.0, "
>> > >> : description":"sum of:", "details":[{ "match":true, "value":30.0, "
>> > >> : description":"FunctionQuery(product(double(category_weight),
>> > >> const(20))),
>> > >> : product of:",
>> > >> : =====================================================
>> > >> : *"details":**[{ "match":true, "value":30.0,
>> > >> : "description":"product(double(category_weight)=1.5,const(20))"},
>> {*
>> > >> : =====================================================
>> > >> :
>> > >> : "match":true, "value":1.0, "description":"boost"}, { "match":true,
>> > >> "value":
>> > >> : 1.0, "description":"queryNorm"}]}, {
>> > >> :
>> > >>
>> > >> -Hoss
>> > >> http://www.lucidworks.com/
>> > >
>> > >
>> > >
>> >
>>
>
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
Thanks!  I'll check it out.

On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <su...@gmail.com>
wrote:

> Not exactly sure what you are looking from chaining the results but similar
> functionality is available in Streaming expressions where result of inner
> expressions are passed to outer expressions and so on
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>
> HTH
> Susheel
>
> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <
> john@johnbickerstaff.com>
> wrote:
>
> > Hossman - many thanks again for your comprehensive and very helpful
> answer!
> >
> > All,
> >
> > I am (possibly mis-remembering) reading something about being able to
> pass
> > the results of one query to another query...  Essentially "chaining"
> result
> > sets.
> >
> > I have looked in docs and can't find anything on a quick search -- I may
> > have been reading about the Re-Ranking feature, which doesn't help me (I
> > know because I just tried and it seems to return all results anyway, just
> > re-ranking the number specified in the reRankDocs flag...)
> >
> > Is there a way to (cleanly) send the results of one query to another
> query
> > for further processing?  Essentially, pass ONLY the results (including an
> > empty set of results) to another query for processing?
> >
> > thanks...
> >
> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
> > john@johnbickerstaff.com>
> > wrote:
> >
> > > Thanks!
> > >
> > > To answer your questions, while I digest the rest of that
> information...
> > >
> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> > > https://github.com/healthonnet/hon-lucene-synonyms
> > >
> > > The config looks like this - and IIRC, is simply a copy from the
> > > recommended cofig on the site mentioned above.
> > >
> > >  <queryParser name="synonym_edismax" class="com.github.healthonnet.
> > search.
> > > SynonymExpandingExtendedDismaxQParserPlugin">
> > >     <!-- You can define more than one synonym analyzer in the following
> > > list.
> > >          For example, you might have one set of synonyms for English,
> one
> > > for French,
> > >          one for Spanish, etc.
> > >       -->
> > >     <lst name="synonymAnalyzers">
> > >       <!-- Name your analyzer something useful, e.g. "analyzer_en",
> > > "analyzer_fr", "analyzer_es", etc.
> > >            If you only have one, the name doesn't matter (hence
> > > "myCoolAnalyzer").
> > >         -->
> > >       <lst name="myCoolAnalyzer">
> > >         <!-- We recommend a PatternTokenizerFactory that tokenizes
> based
> > > on whitespace and quotes.
> > >              This seems to work best with most people's synonym files.
> > >              For details, read the discussion here:
> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
> > >           -->
> > >         <lst name="tokenizer">
> > >           <str name="class">solr.PatternTokenizerFactory</str>
> > >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
> > >         </lst>
> > >         <!-- The ShingleFilterFactory outputs synonyms of multiple
> token
> > > lengths (e.g. unigrams, bigrams, trigrams, etc.).
> > >              The default here is to assume you don't have any synonyms
> > > longer than 4 tokens.
> > >              You can tweak this depending on what your synonyms look
> > like.
> > > E.g. if you only have unigrams, you can remove
> > >              it entirely, and if your synonyms are up to 7 tokens in
> > > length, you should set the maxShingleSize to 7.
> > >           -->
> > >         <lst name="filter">
> > >           <str name="class">solr.ShingleFilterFactory</str>
> > >           <str name="outputUnigramsIfNoShingles">true</str>
> > >           <str name="outputUnigrams">true</str>
> > >           <str name="minShingleSize">2</str>
> > >           <str name="maxShingleSize">4</str>
> > >         </lst>
> > >         <!-- This is where you set your synonym file.  For the unit
> tests
> > > and "Getting Started" examples, we use example_synonym_file.txt.
> > >              This plugin will work best if you keep expand set to true
> > and
> > > have all your synonyms comma-separated (rather than =>-separated).
> > >           -->
> > >         <lst name="filter">
> > >           <str name="class">solr.SynonymFilterFactory</str>
> > >           <str name="tokenizerFactory">solr.
> > KeywordTokenizerFactory</str>
> > >           <str name="synonyms">example_synonym_file.txt</str>
> > >           <str name="expand">true</str>
> > >           <str name="ignoreCase">true</str>
> > >         </lst>
> > >       </lst>
> > >     </lst>
> > >   </queryParser>
> > >
> > >
> > >
> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
> > hossman_lucene@fucit.org
> > > > wrote:
> > >
> > >>
> > >> : First let me say that this is very possibly the "x - y problem" so
> let
> > >> me
> > >> : state up front what my ultimate need is -- then I'll ask about the
> > >> thing I
> > >> : imagine might help...  which, of course, is heavily biased in the
> > >> direction
> > >> : of my experience coding Java and writing SQL...
> > >>
> > >> Thank you so much for asking your question this way!
> > >>
> > >> Right off the bat, the background you've provided seems supicious...
> > >>
> > >> : I have a piece of a query that calculates a score based on a
> > "weighting"
> > >>         ...
> > >> : The specific line is this:
> > >> : <str name="bf">product(field(category_weight),20)</str>
> > >> :
> > >> : What I just realized is that when I query Solr for a string that has
> > NO
> > >> : matches in the entire corpus, I still get a slew of results because
> > >> EVERY
> > >> : doc has the weighting value in the category_weight field - and
> > therefore
> > >> : every doc gets some score.
> > >>
> > >> ...that is *NOT* how dismax and edisamx normally work.
> > >>
> > >> While both the "bf" abd "bq" params result in "additive" boosting, and
> > the
> > >> implementation of that "additive boost" comes from adding new optional
> > >> clauses to the top level BooleanQuery that is executed, that only
> > happens
> > >> after the "main" query (from your "q" param) is added to that top
> level
> > >> BooleanQuery as a "mandaory" clause.
> > >>
> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost every
> > doc,
> > >> but with the techprducts configs/data these requests still don't match
> > >> anything...
> > >>
> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
> > >>
> > >> ...and if you look at the debug output, the parsed queries shows that
> > the
> > >> "bogus" part of the query is mandatory...
> > >>
> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> > >> FunctionQuery(const(true))
> > >>
> > >> (i didn't use "pf" in that example, but the effect is the same, the
> "pf"
> > >> based clauses are optional, while the "qf" based clauses are
> mandatory)
> > >>
> > >> If you compare that example to your debug output, you'll notice a
> > >> difference in structure -- it's a bit hard to see in your example, but
> > if
> > >> you simplify your qf, pf, and q fields it should be more obvious, but
> > >> AFAICT the "main" parts of your query are getting wrapped in an extra
> > >> layer of parents (ie: an extra BooleanQuery) which is *not* mandatory
> in
> > >> the top level query ... i don't see *any* mandatory clauses in your
> top
> > >> level BooleanQuery, which is why any match on a bf or bq function is
> > >> enough to cause a document to match.
> > >>
> > >> I suspect the reason your parsed query structure is so diff has to do
> > with
> > >> this...
> > >>
> > >> :        <str name="defType">synonym_edismax</str>>
> > >>
> > >>
> > >> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
> > >> 2) what QParserPlugin are you using to implement that?
> > >>
> > >> I suspect whatever QParserPlugin you are using has a bug in it :)
> > >>
> > >>
> > >> If you can't fix the bug, one possibile workaround would be to abandon
> > bf
> > >> and bq params completely, and instead wrap the query it produces in
> in a
> > >> {!boost} parser with whatever function you want (using functions like
> > >> sum() or prod() to combine multiple functions, and query() to
> > incorporate
> > >> your current bq param).  Doing this will require chanign how you
> specify
> > >> you input (example below) and it will result in *multiplicitive*
> boosts
> > --
> > >> so your scores will be much diff, and you will likely have to adjust
> > your
> > >> constants, but: 1) multiplicitive boosts are almost always what people
> > >> *really* want anyway; 2) it will ensure the boosts are only applied
> for
> > >> things matching your main query, no matter how that query parser works
> > or
> > >> what bugs it has.
> > >>
> > >> Example of using {!boost} to wrap an arbitrary other parser...
> > >>
> > >> instead of...
> > >>   defType=foofoo
> > >>   q=barbarbar
> > >>
> > >> use...
> > >>    q={!boost b=$func defType=foofoo v=$qq}
> > >>   qq=barbarbar
> > >> func=sum(something,somethingelse)
> > >>
> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
> > >> https://cwiki.apache.org/confluence/display/solr/Function+Queries
> > >>
> > >>
> > >>
> > >>
> > >> :
> > >> : What I would like is to return zero results if there is no match for
> > the
> > >> : querystring.  My collection is small enough that I don't care if the
> > >> actual
> > >> : calculation runs on each doc (although that's wasteful) -- I just
> > don't
> > >> : want to see results come back for zero matches to the querystring
> > >> :
> > >> : (The /select endpoint does this of course, but my custom endpoint
> > >> includes
> > >> : this "weighting" piece and therefore returns every doc in the corpus
> > >> : because they all have the weighting.
> > >> :
> > >> : ====================
> > >> : Enter my imagined solution...  The potential X-Y problem...
> > >> : ====================
> > >> :
> > >> : So - given that I come from a programming background, I immediately
> > >> start
> > >> : thinking of an if statement ...
> > >> :
> > >> :      if(some_score_for_the_primary_search_string) {
> > >> :           run_the_category_weight_calculation;
> > >> :      } else {
> > >> :           do_NOT_run_category_weight_calc;
> > >> :      }
> > >> :
> > >> :
> > >> : Another way of thinking of it would be something like the "WHERE"
> > >> clause in
> > >> : SQL...
> > >> :
> > >> :  run_category_weight_calculation WHERE "searchstring" is found in
> the
> > >> : document, not otherwise.
> > >> :
> > >> : I'm aware that things could be handled in the client-side of my web
> > app,
> > >> : but if possible, I'd like the interface to SOLR to be as clean as
> > >> possible,
> > >> : and massage incoming SOLR data as little as possible.
> > >> :
> > >> : In other words, do NOT return any docs if the querystring (and any
> > >> : synonyms) match zero docs.
> > >> :
> > >> : Here is the endpoint XML for the query.  I've highlighted the
> specific
> > >> line
> > >> : that is causing the unintended results...
> > >> :
> > >> :
> > >> :  <requestHandler name="/foo" class="solr.SearchHandler">
> > >> :     <!-- default values for query parameters can be specified, these
> > >> :          will be overridden by parameters in the request
> > >> :       -->
> > >> :      <lst name="defaults">
> > >> :        <str name="echoParams">all</str>
> > >> :        <int name="rows">20</int>
> > >> :        <!-- Query settings -->
> > >> :        <str name="df">text</str>
> > >> :       <!-- <str name="df">title</str> -->
> > >> :        <str name="defType">synonym_edismax</str>>
> > >> :        <str name="synonyms">true</str>
> > >> :     <!-- The line below balances out the weighting of exact matches
> to
> > >> the
> > >> : synonym phrase entered by the user
> > >> :          with the category_weight calculation and the titleQuery
> calc.
> > >> : These numbers exist in a balance and
> > >> :          if one is raised or lowered, the others (probably) need to
> > >> change
> > >> : as well.  It may be better to go with decimals
> > >> :          for all of them... .4 instead of 4 and 2 instead of 20 and
> > 2.5
> > >> : instead of 25.
> > >> :          In the end, I'm not sure it really matters, but don't
> change
> > >> one
> > >> : without changing the others
> > >> :          unless you've tested and are sure you want the results  -->
> > >> :        <float name="synonyms.originalBoost">1.5</float>
> > >> :        <float name="synonyms.synonymBoost">1.1</float>
> > >> :        <str name="mm">75%</str>
> > >> :        <str name="q.alt">*:*</str>
> > >> :        <str name="rows">20</str>
> > >> :        <str name="fq">meta_doc_type:chapterDoc</str>
> > >> :        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
> > >> : v=$q}</str>
> > >> :        <str name="fl">id category_weight title category_ss score
> > >> : contentType</str>
> > >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
> > >> v=$q}</str>
> > >> : =====================================================
> > >> :        *<str name="bf">product(field(category_weight),20)</str>*
> > >> : =====================================================
> > >> :        <str name="bf">product(query($titleQuery),4)</str>
> > >> :        <str name="qf">text contentType^1000</str>
> > >> :        <str name="wt">python</str>
> > >> :        <str name="debug">true</str>
> > >> :        <str name="debug.explain.structured">true</str>
> > >> :        <str name="indent">true</str>
> > >> :        <str name="echoParams">all</str>
> > >> :      </lst>
> > >> :   </requestHandler>
> > >> :
> > >> : And here is the debug output for a query.  (This was a test for
> > >> synonyms,
> > >> : which you'll see in the output.) The original query string was, of
> > >> : course, "μ-heavy
> > >> : chain disease"
> > >> :
> > >> : You'll note that although there is no score in the first doc explain
> > for
> > >> : the actual querystring, the highlighted section does get a score for
> > >> : product(double(category_weight)=1.5,const(20))
> > >> :
> > >> : ... which is the thing that is currently causing all the docs in the
> > >> : collection to "match" even though the querystring is not in any of
> > them.
> > >> :
> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
> > >> : "querystring":"\"μ-heavy
> > >> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ
> heavy
> > >> chain
> > >> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
> > >> (contentType:\"mu
> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ
> heavy
> > >> chain
> > >> : disease\" | (contentType:\"μ heavy chain
> > disease\")^1000.0)))/no_coord^
> > >> 1.1)
> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ
> > heavy
> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy
> chain
> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy
> chain
> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> > >> : hcd\")))/no_coord^1.1)))
> > >> : FunctionQuery(product(double(category_weight),const(20)))
> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain
> > >> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((
> text:\"μ
> > >> heavy
> > >> : chain disease\" | (contentType:\"μ heavy chain
> disease\")^1000.0))^1.5
> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
> > >> (contentType:\"μ
> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
> > >> (contentType:\"μ
> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ
> hcd\"))^1.1)
> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
> > hcd\"))^1.1)))
> > >> : product(double(category_weight),const(20))
> product(query(+(title:\"μ
> > >> heavy
> > >> : chain disease\"),def=0.0),const(4))", "explain":{ "
> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true,
> "value":30.0, "
> > >> : description":"sum of:", "details":[{ "match":true, "value":30.0, "
> > >> : description":"FunctionQuery(product(double(category_weight),
> > >> const(20))),
> > >> : product of:",
> > >> : =====================================================
> > >> : *"details":**[{ "match":true, "value":30.0,
> > >> : "description":"product(double(category_weight)=1.5,const(20))"}, {*
> > >> : =====================================================
> > >> :
> > >> : "match":true, "value":1.0, "description":"boost"}, { "match":true,
> > >> "value":
> > >> : 1.0, "description":"queryNorm"}]}, {
> > >> :
> > >>
> > >> -Hoss
> > >> http://www.lucidworks.com/
> > >
> > >
> > >
> >
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Posted by Susheel Kumar <su...@gmail.com>.
Not exactly sure what you are looking from chaining the results but similar
functionality is available in Streaming expressions where result of inner
expressions are passed to outer expressions and so on
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

HTH
Susheel

On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <jo...@johnbickerstaff.com>
wrote:

> Hossman - many thanks again for your comprehensive and very helpful answer!
>
> All,
>
> I am (possibly mis-remembering) reading something about being able to pass
> the results of one query to another query...  Essentially "chaining" result
> sets.
>
> I have looked in docs and can't find anything on a quick search -- I may
> have been reading about the Re-Ranking feature, which doesn't help me (I
> know because I just tried and it seems to return all results anyway, just
> re-ranking the number specified in the reRankDocs flag...)
>
> Is there a way to (cleanly) send the results of one query to another query
> for further processing?  Essentially, pass ONLY the results (including an
> empty set of results) to another query for processing?
>
> thanks...
>
> On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
> john@johnbickerstaff.com>
> wrote:
>
> > Thanks!
> >
> > To answer your questions, while I digest the rest of that information...
> >
> > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> > https://github.com/healthonnet/hon-lucene-synonyms
> >
> > The config looks like this - and IIRC, is simply a copy from the
> > recommended cofig on the site mentioned above.
> >
> >  <queryParser name="synonym_edismax" class="com.github.healthonnet.
> search.
> > SynonymExpandingExtendedDismaxQParserPlugin">
> >     <!-- You can define more than one synonym analyzer in the following
> > list.
> >          For example, you might have one set of synonyms for English, one
> > for French,
> >          one for Spanish, etc.
> >       -->
> >     <lst name="synonymAnalyzers">
> >       <!-- Name your analyzer something useful, e.g. "analyzer_en",
> > "analyzer_fr", "analyzer_es", etc.
> >            If you only have one, the name doesn't matter (hence
> > "myCoolAnalyzer").
> >         -->
> >       <lst name="myCoolAnalyzer">
> >         <!-- We recommend a PatternTokenizerFactory that tokenizes based
> > on whitespace and quotes.
> >              This seems to work best with most people's synonym files.
> >              For details, read the discussion here:
> > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
> >           -->
> >         <lst name="tokenizer">
> >           <str name="class">solr.PatternTokenizerFactory</str>
> >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
> >         </lst>
> >         <!-- The ShingleFilterFactory outputs synonyms of multiple token
> > lengths (e.g. unigrams, bigrams, trigrams, etc.).
> >              The default here is to assume you don't have any synonyms
> > longer than 4 tokens.
> >              You can tweak this depending on what your synonyms look
> like.
> > E.g. if you only have unigrams, you can remove
> >              it entirely, and if your synonyms are up to 7 tokens in
> > length, you should set the maxShingleSize to 7.
> >           -->
> >         <lst name="filter">
> >           <str name="class">solr.ShingleFilterFactory</str>
> >           <str name="outputUnigramsIfNoShingles">true</str>
> >           <str name="outputUnigrams">true</str>
> >           <str name="minShingleSize">2</str>
> >           <str name="maxShingleSize">4</str>
> >         </lst>
> >         <!-- This is where you set your synonym file.  For the unit tests
> > and "Getting Started" examples, we use example_synonym_file.txt.
> >              This plugin will work best if you keep expand set to true
> and
> > have all your synonyms comma-separated (rather than =>-separated).
> >           -->
> >         <lst name="filter">
> >           <str name="class">solr.SynonymFilterFactory</str>
> >           <str name="tokenizerFactory">solr.
> KeywordTokenizerFactory</str>
> >           <str name="synonyms">example_synonym_file.txt</str>
> >           <str name="expand">true</str>
> >           <str name="ignoreCase">true</str>
> >         </lst>
> >       </lst>
> >     </lst>
> >   </queryParser>
> >
> >
> >
> > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
> hossman_lucene@fucit.org
> > > wrote:
> >
> >>
> >> : First let me say that this is very possibly the "x - y problem" so let
> >> me
> >> : state up front what my ultimate need is -- then I'll ask about the
> >> thing I
> >> : imagine might help...  which, of course, is heavily biased in the
> >> direction
> >> : of my experience coding Java and writing SQL...
> >>
> >> Thank you so much for asking your question this way!
> >>
> >> Right off the bat, the background you've provided seems supicious...
> >>
> >> : I have a piece of a query that calculates a score based on a
> "weighting"
> >>         ...
> >> : The specific line is this:
> >> : <str name="bf">product(field(category_weight),20)</str>
> >> :
> >> : What I just realized is that when I query Solr for a string that has
> NO
> >> : matches in the entire corpus, I still get a slew of results because
> >> EVERY
> >> : doc has the weighting value in the category_weight field - and
> therefore
> >> : every doc gets some score.
> >>
> >> ...that is *NOT* how dismax and edisamx normally work.
> >>
> >> While both the "bf" abd "bq" params result in "additive" boosting, and
> the
> >> implementation of that "additive boost" comes from adding new optional
> >> clauses to the top level BooleanQuery that is executed, that only
> happens
> >> after the "main" query (from your "q" param) is added to that top level
> >> BooleanQuery as a "mandaory" clause.
> >>
> >> So, for example, "bf=true()" and "bq=*:*" should match & boost every
> doc,
> >> but with the techprducts configs/data these requests still don't match
> >> anything...
> >>
> >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
> >>
> >> ...and if you look at the debug output, the parsed queries shows that
> the
> >> "bogus" part of the query is mandatory...
> >>
> >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> >> FunctionQuery(const(true))
> >>
> >> (i didn't use "pf" in that example, but the effect is the same, the "pf"
> >> based clauses are optional, while the "qf" based clauses are mandatory)
> >>
> >> If you compare that example to your debug output, you'll notice a
> >> difference in structure -- it's a bit hard to see in your example, but
> if
> >> you simplify your qf, pf, and q fields it should be more obvious, but
> >> AFAICT the "main" parts of your query are getting wrapped in an extra
> >> layer of parents (ie: an extra BooleanQuery) which is *not* mandatory in
> >> the top level query ... i don't see *any* mandatory clauses in your top
> >> level BooleanQuery, which is why any match on a bf or bq function is
> >> enough to cause a document to match.
> >>
> >> I suspect the reason your parsed query structure is so diff has to do
> with
> >> this...
> >>
> >> :        <str name="defType">synonym_edismax</str>>
> >>
> >>
> >> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
> >> 2) what QParserPlugin are you using to implement that?
> >>
> >> I suspect whatever QParserPlugin you are using has a bug in it :)
> >>
> >>
> >> If you can't fix the bug, one possibile workaround would be to abandon
> bf
> >> and bq params completely, and instead wrap the query it produces in in a
> >> {!boost} parser with whatever function you want (using functions like
> >> sum() or prod() to combine multiple functions, and query() to
> incorporate
> >> your current bq param).  Doing this will require chanign how you specify
> >> you input (example below) and it will result in *multiplicitive* boosts
> --
> >> so your scores will be much diff, and you will likely have to adjust
> your
> >> constants, but: 1) multiplicitive boosts are almost always what people
> >> *really* want anyway; 2) it will ensure the boosts are only applied for
> >> things matching your main query, no matter how that query parser works
> or
> >> what bugs it has.
> >>
> >> Example of using {!boost} to wrap an arbitrary other parser...
> >>
> >> instead of...
> >>   defType=foofoo
> >>   q=barbarbar
> >>
> >> use...
> >>    q={!boost b=$func defType=foofoo v=$qq}
> >>   qq=barbarbar
> >> func=sum(something,somethingelse)
> >>
> >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
> >> https://cwiki.apache.org/confluence/display/solr/Function+Queries
> >>
> >>
> >>
> >>
> >> :
> >> : What I would like is to return zero results if there is no match for
> the
> >> : querystring.  My collection is small enough that I don't care if the
> >> actual
> >> : calculation runs on each doc (although that's wasteful) -- I just
> don't
> >> : want to see results come back for zero matches to the querystring
> >> :
> >> : (The /select endpoint does this of course, but my custom endpoint
> >> includes
> >> : this "weighting" piece and therefore returns every doc in the corpus
> >> : because they all have the weighting.
> >> :
> >> : ====================
> >> : Enter my imagined solution...  The potential X-Y problem...
> >> : ====================
> >> :
> >> : So - given that I come from a programming background, I immediately
> >> start
> >> : thinking of an if statement ...
> >> :
> >> :      if(some_score_for_the_primary_search_string) {
> >> :           run_the_category_weight_calculation;
> >> :      } else {
> >> :           do_NOT_run_category_weight_calc;
> >> :      }
> >> :
> >> :
> >> : Another way of thinking of it would be something like the "WHERE"
> >> clause in
> >> : SQL...
> >> :
> >> :  run_category_weight_calculation WHERE "searchstring" is found in the
> >> : document, not otherwise.
> >> :
> >> : I'm aware that things could be handled in the client-side of my web
> app,
> >> : but if possible, I'd like the interface to SOLR to be as clean as
> >> possible,
> >> : and massage incoming SOLR data as little as possible.
> >> :
> >> : In other words, do NOT return any docs if the querystring (and any
> >> : synonyms) match zero docs.
> >> :
> >> : Here is the endpoint XML for the query.  I've highlighted the specific
> >> line
> >> : that is causing the unintended results...
> >> :
> >> :
> >> :  <requestHandler name="/foo" class="solr.SearchHandler">
> >> :     <!-- default values for query parameters can be specified, these
> >> :          will be overridden by parameters in the request
> >> :       -->
> >> :      <lst name="defaults">
> >> :        <str name="echoParams">all</str>
> >> :        <int name="rows">20</int>
> >> :        <!-- Query settings -->
> >> :        <str name="df">text</str>
> >> :       <!-- <str name="df">title</str> -->
> >> :        <str name="defType">synonym_edismax</str>>
> >> :        <str name="synonyms">true</str>
> >> :     <!-- The line below balances out the weighting of exact matches to
> >> the
> >> : synonym phrase entered by the user
> >> :          with the category_weight calculation and the titleQuery calc.
> >> : These numbers exist in a balance and
> >> :          if one is raised or lowered, the others (probably) need to
> >> change
> >> : as well.  It may be better to go with decimals
> >> :          for all of them... .4 instead of 4 and 2 instead of 20 and
> 2.5
> >> : instead of 25.
> >> :          In the end, I'm not sure it really matters, but don't change
> >> one
> >> : without changing the others
> >> :          unless you've tested and are sure you want the results  -->
> >> :        <float name="synonyms.originalBoost">1.5</float>
> >> :        <float name="synonyms.synonymBoost">1.1</float>
> >> :        <str name="mm">75%</str>
> >> :        <str name="q.alt">*:*</str>
> >> :        <str name="rows">20</str>
> >> :        <str name="fq">meta_doc_type:chapterDoc</str>
> >> :        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
> >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
> >> : v=$q}</str>
> >> :        <str name="fl">id category_weight title category_ss score
> >> : contentType</str>
> >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
> >> v=$q}</str>
> >> : =====================================================
> >> :        *<str name="bf">product(field(category_weight),20)</str>*
> >> : =====================================================
> >> :        <str name="bf">product(query($titleQuery),4)</str>
> >> :        <str name="qf">text contentType^1000</str>
> >> :        <str name="wt">python</str>
> >> :        <str name="debug">true</str>
> >> :        <str name="debug.explain.structured">true</str>
> >> :        <str name="indent">true</str>
> >> :        <str name="echoParams">all</str>
> >> :      </lst>
> >> :   </requestHandler>
> >> :
> >> : And here is the debug output for a query.  (This was a test for
> >> synonyms,
> >> : which you'll see in the output.) The original query string was, of
> >> : course, "μ-heavy
> >> : chain disease"
> >> :
> >> : You'll note that although there is no score in the first doc explain
> for
> >> : the actual querystring, the highlighted section does get a score for
> >> : product(double(category_weight)=1.5,const(20))
> >> :
> >> : ... which is the thing that is currently causing all the docs in the
> >> : collection to "match" even though the querystring is not in any of
> them.
> >> :
> >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
> >> : "querystring":"\"μ-heavy
> >> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ heavy
> >> chain
> >> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
> >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
> >> (contentType:\"mu
> >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
> >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ heavy
> >> chain
> >> : disease\" | (contentType:\"μ heavy chain
> disease\")^1000.0)))/no_coord^
> >> 1.1)
> >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ
> heavy
> >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy chain
> >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy chain
> >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> >> : hcd\")))/no_coord^1.1)))
> >> : FunctionQuery(product(double(category_weight),const(20)))
> >> : FunctionQuery(product(query(+(title:\"μ heavy chain
> >> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((text:\"μ
> >> heavy
> >> : chain disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
> >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
> >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
> >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
> >> (contentType:\"μ
> >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
> >> (contentType:\"μ
> >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
> >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)
> >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
> hcd\"))^1.1)))
> >> : product(double(category_weight),const(20)) product(query(+(title:\"μ
> >> heavy
> >> : chain disease\"),def=0.0),const(4))", "explain":{ "
> >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true, "value":30.0, "
> >> : description":"sum of:", "details":[{ "match":true, "value":30.0, "
> >> : description":"FunctionQuery(product(double(category_weight),
> >> const(20))),
> >> : product of:",
> >> : =====================================================
> >> : *"details":**[{ "match":true, "value":30.0,
> >> : "description":"product(double(category_weight)=1.5,const(20))"}, {*
> >> : =====================================================
> >> :
> >> : "match":true, "value":1.0, "description":"boost"}, { "match":true,
> >> "value":
> >> : 1.0, "description":"queryNorm"}]}, {
> >> :
> >>
> >> -Hoss
> >> http://www.lucidworks.com/
> >
> >
> >
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
Hossman - many thanks again for your comprehensive and very helpful answer!

All,

I am (possibly mis-remembering) reading something about being able to pass
the results of one query to another query...  Essentially "chaining" result
sets.

I have looked in docs and can't find anything on a quick search -- I may
have been reading about the Re-Ranking feature, which doesn't help me (I
know because I just tried and it seems to return all results anyway, just
re-ranking the number specified in the reRankDocs flag...)

Is there a way to (cleanly) send the results of one query to another query
for further processing?  Essentially, pass ONLY the results (including an
empty set of results) to another query for processing?

thanks...

On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <jo...@johnbickerstaff.com>
wrote:

> Thanks!
>
> To answer your questions, while I digest the rest of that information...
>
> I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> https://github.com/healthonnet/hon-lucene-synonyms
>
> The config looks like this - and IIRC, is simply a copy from the
> recommended cofig on the site mentioned above.
>
>  <queryParser name="synonym_edismax" class="com.github.healthonnet.search.
> SynonymExpandingExtendedDismaxQParserPlugin">
>     <!-- You can define more than one synonym analyzer in the following
> list.
>          For example, you might have one set of synonyms for English, one
> for French,
>          one for Spanish, etc.
>       -->
>     <lst name="synonymAnalyzers">
>       <!-- Name your analyzer something useful, e.g. "analyzer_en",
> "analyzer_fr", "analyzer_es", etc.
>            If you only have one, the name doesn't matter (hence
> "myCoolAnalyzer").
>         -->
>       <lst name="myCoolAnalyzer">
>         <!-- We recommend a PatternTokenizerFactory that tokenizes based
> on whitespace and quotes.
>              This seems to work best with most people's synonym files.
>              For details, read the discussion here:
> http://github.com/healthonnet/hon-lucene-synonyms/issues/26
>           -->
>         <lst name="tokenizer">
>           <str name="class">solr.PatternTokenizerFactory</str>
>           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
>         </lst>
>         <!-- The ShingleFilterFactory outputs synonyms of multiple token
> lengths (e.g. unigrams, bigrams, trigrams, etc.).
>              The default here is to assume you don't have any synonyms
> longer than 4 tokens.
>              You can tweak this depending on what your synonyms look like.
> E.g. if you only have unigrams, you can remove
>              it entirely, and if your synonyms are up to 7 tokens in
> length, you should set the maxShingleSize to 7.
>           -->
>         <lst name="filter">
>           <str name="class">solr.ShingleFilterFactory</str>
>           <str name="outputUnigramsIfNoShingles">true</str>
>           <str name="outputUnigrams">true</str>
>           <str name="minShingleSize">2</str>
>           <str name="maxShingleSize">4</str>
>         </lst>
>         <!-- This is where you set your synonym file.  For the unit tests
> and "Getting Started" examples, we use example_synonym_file.txt.
>              This plugin will work best if you keep expand set to true and
> have all your synonyms comma-separated (rather than =>-separated).
>           -->
>         <lst name="filter">
>           <str name="class">solr.SynonymFilterFactory</str>
>           <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
>           <str name="synonyms">example_synonym_file.txt</str>
>           <str name="expand">true</str>
>           <str name="ignoreCase">true</str>
>         </lst>
>       </lst>
>     </lst>
>   </queryParser>
>
>
>
> On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <hossman_lucene@fucit.org
> > wrote:
>
>>
>> : First let me say that this is very possibly the "x - y problem" so let
>> me
>> : state up front what my ultimate need is -- then I'll ask about the
>> thing I
>> : imagine might help...  which, of course, is heavily biased in the
>> direction
>> : of my experience coding Java and writing SQL...
>>
>> Thank you so much for asking your question this way!
>>
>> Right off the bat, the background you've provided seems supicious...
>>
>> : I have a piece of a query that calculates a score based on a "weighting"
>>         ...
>> : The specific line is this:
>> : <str name="bf">product(field(category_weight),20)</str>
>> :
>> : What I just realized is that when I query Solr for a string that has NO
>> : matches in the entire corpus, I still get a slew of results because
>> EVERY
>> : doc has the weighting value in the category_weight field - and therefore
>> : every doc gets some score.
>>
>> ...that is *NOT* how dismax and edisamx normally work.
>>
>> While both the "bf" abd "bq" params result in "additive" boosting, and the
>> implementation of that "additive boost" comes from adding new optional
>> clauses to the top level BooleanQuery that is executed, that only happens
>> after the "main" query (from your "q" param) is added to that top level
>> BooleanQuery as a "mandaory" clause.
>>
>> So, for example, "bf=true()" and "bq=*:*" should match & boost every doc,
>> but with the techprducts configs/data these requests still don't match
>> anything...
>>
>> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
>> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
>>
>> ...and if you look at the debug output, the parsed queries shows that the
>> "bogus" part of the query is mandatory...
>>
>> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
>> FunctionQuery(const(true))
>>
>> (i didn't use "pf" in that example, but the effect is the same, the "pf"
>> based clauses are optional, while the "qf" based clauses are mandatory)
>>
>> If you compare that example to your debug output, you'll notice a
>> difference in structure -- it's a bit hard to see in your example, but if
>> you simplify your qf, pf, and q fields it should be more obvious, but
>> AFAICT the "main" parts of your query are getting wrapped in an extra
>> layer of parents (ie: an extra BooleanQuery) which is *not* mandatory in
>> the top level query ... i don't see *any* mandatory clauses in your top
>> level BooleanQuery, which is why any match on a bf or bq function is
>> enough to cause a document to match.
>>
>> I suspect the reason your parsed query structure is so diff has to do with
>> this...
>>
>> :        <str name="defType">synonym_edismax</str>>
>>
>>
>> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
>> 2) what QParserPlugin are you using to implement that?
>>
>> I suspect whatever QParserPlugin you are using has a bug in it :)
>>
>>
>> If you can't fix the bug, one possibile workaround would be to abandon bf
>> and bq params completely, and instead wrap the query it produces in in a
>> {!boost} parser with whatever function you want (using functions like
>> sum() or prod() to combine multiple functions, and query() to incorporate
>> your current bq param).  Doing this will require chanign how you specify
>> you input (example below) and it will result in *multiplicitive* boosts --
>> so your scores will be much diff, and you will likely have to adjust your
>> constants, but: 1) multiplicitive boosts are almost always what people
>> *really* want anyway; 2) it will ensure the boosts are only applied for
>> things matching your main query, no matter how that query parser works or
>> what bugs it has.
>>
>> Example of using {!boost} to wrap an arbitrary other parser...
>>
>> instead of...
>>   defType=foofoo
>>   q=barbarbar
>>
>> use...
>>    q={!boost b=$func defType=foofoo v=$qq}
>>   qq=barbarbar
>> func=sum(something,somethingelse)
>>
>> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
>> https://cwiki.apache.org/confluence/display/solr/Function+Queries
>>
>>
>>
>>
>> :
>> : What I would like is to return zero results if there is no match for the
>> : querystring.  My collection is small enough that I don't care if the
>> actual
>> : calculation runs on each doc (although that's wasteful) -- I just don't
>> : want to see results come back for zero matches to the querystring
>> :
>> : (The /select endpoint does this of course, but my custom endpoint
>> includes
>> : this "weighting" piece and therefore returns every doc in the corpus
>> : because they all have the weighting.
>> :
>> : ====================
>> : Enter my imagined solution...  The potential X-Y problem...
>> : ====================
>> :
>> : So - given that I come from a programming background, I immediately
>> start
>> : thinking of an if statement ...
>> :
>> :      if(some_score_for_the_primary_search_string) {
>> :           run_the_category_weight_calculation;
>> :      } else {
>> :           do_NOT_run_category_weight_calc;
>> :      }
>> :
>> :
>> : Another way of thinking of it would be something like the "WHERE"
>> clause in
>> : SQL...
>> :
>> :  run_category_weight_calculation WHERE "searchstring" is found in the
>> : document, not otherwise.
>> :
>> : I'm aware that things could be handled in the client-side of my web app,
>> : but if possible, I'd like the interface to SOLR to be as clean as
>> possible,
>> : and massage incoming SOLR data as little as possible.
>> :
>> : In other words, do NOT return any docs if the querystring (and any
>> : synonyms) match zero docs.
>> :
>> : Here is the endpoint XML for the query.  I've highlighted the specific
>> line
>> : that is causing the unintended results...
>> :
>> :
>> :  <requestHandler name="/foo" class="solr.SearchHandler">
>> :     <!-- default values for query parameters can be specified, these
>> :          will be overridden by parameters in the request
>> :       -->
>> :      <lst name="defaults">
>> :        <str name="echoParams">all</str>
>> :        <int name="rows">20</int>
>> :        <!-- Query settings -->
>> :        <str name="df">text</str>
>> :       <!-- <str name="df">title</str> -->
>> :        <str name="defType">synonym_edismax</str>>
>> :        <str name="synonyms">true</str>
>> :     <!-- The line below balances out the weighting of exact matches to
>> the
>> : synonym phrase entered by the user
>> :          with the category_weight calculation and the titleQuery calc.
>> : These numbers exist in a balance and
>> :          if one is raised or lowered, the others (probably) need to
>> change
>> : as well.  It may be better to go with decimals
>> :          for all of them... .4 instead of 4 and 2 instead of 20 and 2.5
>> : instead of 25.
>> :          In the end, I'm not sure it really matters, but don't change
>> one
>> : without changing the others
>> :          unless you've tested and are sure you want the results  -->
>> :        <float name="synonyms.originalBoost">1.5</float>
>> :        <float name="synonyms.synonymBoost">1.1</float>
>> :        <str name="mm">75%</str>
>> :        <str name="q.alt">*:*</str>
>> :        <str name="rows">20</str>
>> :        <str name="fq">meta_doc_type:chapterDoc</str>
>> :        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
>> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
>> : v=$q}</str>
>> :        <str name="fl">id category_weight title category_ss score
>> : contentType</str>
>> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
>> v=$q}</str>
>> : =====================================================
>> :        *<str name="bf">product(field(category_weight),20)</str>*
>> : =====================================================
>> :        <str name="bf">product(query($titleQuery),4)</str>
>> :        <str name="qf">text contentType^1000</str>
>> :        <str name="wt">python</str>
>> :        <str name="debug">true</str>
>> :        <str name="debug.explain.structured">true</str>
>> :        <str name="indent">true</str>
>> :        <str name="echoParams">all</str>
>> :      </lst>
>> :   </requestHandler>
>> :
>> : And here is the debug output for a query.  (This was a test for
>> synonyms,
>> : which you'll see in the output.) The original query string was, of
>> : course, "μ-heavy
>> : chain disease"
>> :
>> : You'll note that although there is no score in the first doc explain for
>> : the actual querystring, the highlighted section does get a score for
>> : product(double(category_weight)=1.5,const(20))
>> :
>> : ... which is the thing that is currently causing all the docs in the
>> : collection to "match" even though the querystring is not in any of them.
>> :
>> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
>> : "querystring":"\"μ-heavy
>> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ heavy
>> chain
>> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
>> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
>> (contentType:\"mu
>> : heavy chain disease\")^1000.0)))/no_coord^1.1)
>> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
>> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ heavy
>> chain
>> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0)))/no_coord^
>> 1.1)
>> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
>> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ heavy
>> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy chain
>> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
>> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy chain
>> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
>> : hcd\")))/no_coord^1.1)))
>> : FunctionQuery(product(double(category_weight),const(20)))
>> : FunctionQuery(product(query(+(title:\"μ heavy chain
>> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((text:\"μ
>> heavy
>> : chain disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
>> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
>> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
>> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
>> (contentType:\"μ
>> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
>> (contentType:\"μ
>> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
>> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)
>> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)))
>> : product(double(category_weight),const(20)) product(query(+(title:\"μ
>> heavy
>> : chain disease\"),def=0.0),const(4))", "explain":{ "
>> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true, "value":30.0, "
>> : description":"sum of:", "details":[{ "match":true, "value":30.0, "
>> : description":"FunctionQuery(product(double(category_weight),
>> const(20))),
>> : product of:",
>> : =====================================================
>> : *"details":**[{ "match":true, "value":30.0,
>> : "description":"product(double(category_weight)=1.5,const(20))"}, {*
>> : =====================================================
>> :
>> : "match":true, "value":1.0, "description":"boost"}, { "match":true,
>> "value":
>> : 1.0, "description":"queryNorm"}]}, {
>> :
>>
>> -Hoss
>> http://www.lucidworks.com/
>
>
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Posted by Chris Hostetter <ho...@fucit.org>.
: First let me say that this is very possibly the "x - y problem" so let me
: state up front what my ultimate need is -- then I'll ask about the thing I
: imagine might help...  which, of course, is heavily biased in the direction
: of my experience coding Java and writing SQL...

Thank you so much for asking your question this way!

Right off the bat, the background you've provided seems supicious...

: I have a piece of a query that calculates a score based on a "weighting"
	...
: The specific line is this:
: <str name="bf">product(field(category_weight),20)</str>
: 
: What I just realized is that when I query Solr for a string that has NO
: matches in the entire corpus, I still get a slew of results because EVERY
: doc has the weighting value in the category_weight field - and therefore
: every doc gets some score.

...that is *NOT* how dismax and edisamx normally work.  

While both the "bf" abd "bq" params result in "additive" boosting, and the 
implementation of that "additive boost" comes from adding new optional 
clauses to the top level BooleanQuery that is executed, that only happens 
after the "main" query (from your "q" param) is added to that top level 
BooleanQuery as a "mandaory" clause.

So, for example, "bf=true()" and "bq=*:*" should match & boost every doc, 
but with the techprducts configs/data these requests still don't match 
anything...

/select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
/select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query

...and if you look at the debug output, the parsed queries shows that the 
"bogus" part of the query is mandatory...

+DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*) FunctionQuery(const(true))

(i didn't use "pf" in that example, but the effect is the same, the "pf" 
based clauses are optional, while the "qf" based clauses are mandatory)

If you compare that example to your debug output, you'll notice a 
difference in structure -- it's a bit hard to see in your example, but if 
you simplify your qf, pf, and q fields it should be more obvious, but 
AFAICT the "main" parts of your query are getting wrapped in an extra 
layer of parents (ie: an extra BooleanQuery) which is *not* mandatory in 
the top level query ... i don't see *any* mandatory clauses in your top 
level BooleanQuery, which is why any match on a bf or bq function is 
enough to cause a document to match.

I suspect the reason your parsed query structure is so diff has to do with 
this...

:        <str name="defType">synonym_edismax</str>>


1) how exactly is "synonym_edismax" defined in your solrconfig.xml? 
2) what QParserPlugin are you using to implement that?

I suspect whatever QParserPlugin you are using has a bug in it :)


If you can't fix the bug, one possibile workaround would be to abandon bf 
and bq params completely, and instead wrap the query it produces in in a 
{!boost} parser with whatever function you want (using functions like
sum() or prod() to combine multiple functions, and query() to incorporate 
your current bq param).  Doing this will require chanign how you specify 
you input (example below) and it will result in *multiplicitive* boosts -- 
so your scores will be much diff, and you will likely have to adjust your 
constants, but: 1) multiplicitive boosts are almost always what people 
*really* want anyway; 2) it will ensure the boosts are only applied for 
things matching your main query, no matter how that query parser works or 
what bugs it has.

Example of using {!boost} to wrap an arbitrary other parser...

instead of...
  defType=foofoo
  q=barbarbar

use...
   q={!boost b=$func defType=foofoo v=$qq}
  qq=barbarbar
func=sum(something,somethingelse)

https://cwiki.apache.org/confluence/display/solr/Other+Parsers
https://cwiki.apache.org/confluence/display/solr/Function+Queries




: 
: What I would like is to return zero results if there is no match for the
: querystring.  My collection is small enough that I don't care if the actual
: calculation runs on each doc (although that's wasteful) -- I just don't
: want to see results come back for zero matches to the querystring
: 
: (The /select endpoint does this of course, but my custom endpoint includes
: this "weighting" piece and therefore returns every doc in the corpus
: because they all have the weighting.
: 
: ====================
: Enter my imagined solution...  The potential X-Y problem...
: ====================
: 
: So - given that I come from a programming background, I immediately start
: thinking of an if statement ...
: 
:      if(some_score_for_the_primary_search_string) {
:           run_the_category_weight_calculation;
:      } else {
:           do_NOT_run_category_weight_calc;
:      }
: 
: 
: Another way of thinking of it would be something like the "WHERE" clause in
: SQL...
: 
:  run_category_weight_calculation WHERE "searchstring" is found in the
: document, not otherwise.
: 
: I'm aware that things could be handled in the client-side of my web app,
: but if possible, I'd like the interface to SOLR to be as clean as possible,
: and massage incoming SOLR data as little as possible.
: 
: In other words, do NOT return any docs if the querystring (and any
: synonyms) match zero docs.
: 
: Here is the endpoint XML for the query.  I've highlighted the specific line
: that is causing the unintended results...
: 
: 
:  <requestHandler name="/foo" class="solr.SearchHandler">
:     <!-- default values for query parameters can be specified, these
:          will be overridden by parameters in the request
:       -->
:      <lst name="defaults">
:        <str name="echoParams">all</str>
:        <int name="rows">20</int>
:        <!-- Query settings -->
:        <str name="df">text</str>
:       <!-- <str name="df">title</str> -->
:        <str name="defType">synonym_edismax</str>>
:        <str name="synonyms">true</str>
:     <!-- The line below balances out the weighting of exact matches to the
: synonym phrase entered by the user
:          with the category_weight calculation and the titleQuery calc.
: These numbers exist in a balance and
:          if one is raised or lowered, the others (probably) need to change
: as well.  It may be better to go with decimals
:          for all of them... .4 instead of 4 and 2 instead of 20 and 2.5
: instead of 25.
:          In the end, I'm not sure it really matters, but don't change one
: without changing the others
:          unless you've tested and are sure you want the results  -->
:        <float name="synonyms.originalBoost">1.5</float>
:        <float name="synonyms.synonymBoost">1.1</float>
:        <str name="mm">75%</str>
:        <str name="q.alt">*:*</str>
:        <str name="rows">20</str>
:        <str name="fq">meta_doc_type:chapterDoc</str>
:        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
: synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
: v=$q}</str>
:        <str name="fl">id category_weight title category_ss score
: contentType</str>
:        <str name="titleQuery">{!edismax qf='title' bf='' bq='' v=$q}</str>
: =====================================================
:        *<str name="bf">product(field(category_weight),20)</str>*
: =====================================================
:        <str name="bf">product(query($titleQuery),4)</str>
:        <str name="qf">text contentType^1000</str>
:        <str name="wt">python</str>
:        <str name="debug">true</str>
:        <str name="debug.explain.structured">true</str>
:        <str name="indent">true</str>
:        <str name="echoParams">all</str>
:      </lst>
:   </requestHandler>
: 
: And here is the debug output for a query.  (This was a test for synonyms,
: which you'll see in the output.) The original query string was, of
: course, "-heavy
: chain disease"
: 
: You'll note that although there is no score in the first doc explain for
: the actual querystring, the highlighted section does get a score for
: product(double(category_weight)=1.5,const(20))
: 
: ... which is the thing that is currently causing all the docs in the
: collection to "match" even though the querystring is not in any of them.
: 
: "debug":{ "rawquerystring":"\"-heavy chain disease\"",
: "querystring":"\"-heavy
: chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\" heavy chain
: disease\" | (contentType:\" heavy chain disease\")^1000.0))^1.5
: ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" | (contentType:\"mu
: heavy chain disease\")^1000.0)))/no_coord^1.1)
: ((+DisjunctionMaxQuery((text:\" hcd\" | (contentType:\"
: hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\" heavy chain
: disease\" | (contentType:\" heavy chain disease\")^1000.0)))/no_coord^1.1)
: ((+DisjunctionMaxQuery((text:\" hcd\" | (contentType:\"
: hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\" heavy
: chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy chain
: disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"
: hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\" heavy chain
: disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"
: hcd\")))/no_coord^1.1)))
: FunctionQuery(product(double(category_weight),const(20)))
: FunctionQuery(product(query(+(title:\" heavy chain
: disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((text:\" heavy
: chain disease\" | (contentType:\" heavy chain disease\")^1000.0))^1.5
: ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
: disease\")^1000.0))^1.1) ((+(text:\" hcd\" | (contentType:\"
: hcd\")^1000.0))^1.1) ((+(text:\" heavy chain disease\" | (contentType:\"
: heavy chain disease\")^1000.0))^1.1) ((+(text:\" hcd\" | (contentType:\"
: hcd\")^1000.0))^1.1)) ((((title:\" heavy chain disease\"))^2.5
: ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\" hcd\"))^1.1)
: ((+(title:\" heavy chain disease\"))^1.1) ((+(title:\" hcd\"))^1.1)))
: product(double(category_weight),const(20)) product(query(+(title:\" heavy
: chain disease\"),def=0.0),const(4))", "explain":{ "
: 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true, "value":30.0, "
: description":"sum of:", "details":[{ "match":true, "value":30.0, "
: description":"FunctionQuery(product(double(category_weight),const(20))),
: product of:",
: =====================================================
: *"details":**[{ "match":true, "value":30.0,
: "description":"product(double(category_weight)=1.5,const(20))"}, {*
: =====================================================
: 
: "match":true, "value":1.0, "description":"boost"}, { "match":true, "value":
: 1.0, "description":"queryNorm"}]}, {
: 

-Hoss
http://www.lucidworks.com/