You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Moved) (JIRA)" <ji...@apache.org> on 2012/02/23 23:43:49 UTC

[jira] [Moved] (LUCENE-3821) search slop problem introduced somewhere between Solr 1.4 and Solr 3.5

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir moved SOLR-3158 to LUCENE-3821:
-------------------------------------------

          Component/s:     (was: search)
        Lucene Fields: New
    Affects Version/s:     (was: 3.5)
                       4.0
                       3.5
                  Key: LUCENE-3821  (was: SOLR-3158)
              Project: Lucene - Java  (was: Solr)
    
> search slop problem introduced somewhere between Solr 1.4 and Solr 3.5
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: schema.xml, solrconfig-test.xml
>
>
> In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
>   "The Beatles as musicians : Revolver through the Anthology"
>   "Color-blindness [print/digital]; its dangers and its detection"
> Both of these queries have a repeated work, and have many terms.  It's not the number of terms or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and Solr 1.4):
>     "International encyclopedia of revolution and protest : 1500 to the present"
> With Robert Muir's help, we have narrowed the problem down to slop  (proximity in lucene QueryParser, query slop in dismax).   I have included debugQuery details for  the Beatles search;  I confirmed the same behavior with the color-blindness search.
> Solr 3.5:   it fails when (query) slop setting isn't 0.
> ----
> lucene QueryParser with proximity set to 1 (or anything > 0) :  no match
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
> lucene QueryParser with proximity set to 0:    result!
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>   6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
>      <snip>
>       48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
>      <snip>
> dismax QueryParser with qs=1:  no match
>       ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>      ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   8.564867 = (MATCH) sum of:
>     4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
>         <snip>
>         48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
>         <snip>
> Solr 1.4:    it works regardless of slop settings
> ----
> lucene QueryParser with any proximity value:    result!
>       ~0
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>       ~1
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
>   5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in 3469163), product of:
>      <snip>
>     48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
>      <snip>
> dismax QueryParser with any qs:    result!
>       qs=0, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=0, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>       qs=1, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=1, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   7.4490223 = (MATCH) sum of:
>   3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1 in 3469163), product of:
>         <snip>
>       48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
>         <snip>
> More information:
> schema.xml:
>   <field name="all_search" type="text" indexed="true" stored="false" />
> solr 3.5:
>       <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="solr.ICUFoldingFilterFactory"/>  
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> solr1.4:
> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
>         <filter class="solr.WordDelimiterFilterFactory" 
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> And the analysis page shows the same results for Solr 3.5 and 1.4
> Solr 3.5:
> position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> keyword 	false	false	false	false	false	false	false	false
> startOffset 	0	4	12	15	27	36	44	48
> endOffset 	3	11	14	24	35	43	47	57
> type 	word	word	word	word	word	word	word	word
> Solr 1.4:
> term position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> term type 	word	word	word	word	word	word	word	word
> source start,end 	0,3	4,11	12,14	15,24	27,35	36,43	44,47	48,57
> For debug purposes, we can consider the Solr document as:
> <doc>
>   <str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
> </doc>
> I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I work in a library with very rich metadata and very exacting solr fields).  I have attached the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.
> For more details, see the erroneously titled email thread "result present in Solr 1.4 but missing in Solr 3.5, dismax only"  started on 2012-02-22 on solr-user@lucene.apache.org.
> - Naomi

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org