You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Naomi Dushay (Created) (JIRA)" <ji...@apache.org> on 2012/02/23 23:23:50 UTC

[jira] [Created] (SOLR-3158) search slop problem introduced somewhere between Solr 1.4 and Solr 3.5

search slop problem introduced somewhere between Solr 1.4 and Solr 3.5
----------------------------------------------------------------------

                 Key: SOLR-3158
                 URL: https://issues.apache.org/jira/browse/SOLR-3158
             Project: Solr
          Issue Type: Bug
          Components: search
    Affects Versions: 3.5
            Reporter: Naomi Dushay


In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
  "The Beatles as musicians : Revolver through the Anthology"
  "Color-blindness [print/digital]; its dangers and its detection"
Both of these queries have a repeated work, and have many terms.  It's not the number of terms or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and Solr 1.4):
    "International encyclopedia of revolution and protest : 1500 to the present"

With Robert Muir's help, we have narrowed the problem down to slop  (proximity in lucene QueryParser, query slop in dismax).   I have included debugQuery details for  the Beatles search;  I confirmed the same behavior with the color-blindness search.


Solr 3.5:   it fails when (query) slop setting isn't 0.
----
lucene QueryParser with proximity set to 1 (or anything > 0) :  no match
  URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
  final query:  all_search:"the beatl as musician revolv through the antholog"~1

lucene QueryParser with proximity set to 0:    result!
  URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
  final query:  all_search:"the beatl as musician revolv through the antholog"

  6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
     <snip>
      48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
     <snip>

dismax QueryParser with qs=1:  no match
      ps=0
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
  final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
      ps=1
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
  final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

dismax QueryParser with qs=0:    result!
     ps=0
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
  final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
      ps=1
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
  final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

  8.564867 = (MATCH) sum of:
    4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
        <snip>
        48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
        <snip>


Solr 1.4:    it works regardless of slop settings
----
lucene QueryParser with any proximity value:    result!
      ~0
  URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
  final query:  all_search:"the beatl as musician revolv through the antholog"
      ~1
  URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
  final query:  all_search:"the beatl as musician revolv through the antholog"~1

  5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in 3469163), product of:
     <snip>
    48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
     <snip>

dismax QueryParser with any qs:    result!
      qs=0, ps=0
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
   final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
      qs=0, ps=1
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
   final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
dismax QueryParser with qs=0:    result!
      qs=1, ps=0
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
   final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
      qs=1, ps=1
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
   final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

  7.4490223 = (MATCH) sum of:
  3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1 in 3469163), product of:
        <snip>
      48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
        <snip>


More information:

schema.xml:
  <field name="all_search" type="text" indexed="true" stored="false" />

solr 3.5:
      <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.ICUFoldingFilterFactory"/>  
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldtype>

solr1.4:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
        <filter class="solr.WordDelimiterFilterFactory" 
          splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 
          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 
          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldtype>


And the analysis page shows the same results for Solr 3.5 and 1.4


Solr 3.5:

position 	1	2	3	4	5	6	7	8
term text 	the	beatl	as	musician	revolv	through	the	antholog
keyword 	false	false	false	false	false	false	false	false
startOffset 	0	4	12	15	27	36	44	48
endOffset 	3	11	14	24	35	43	47	57
type 	word	word	word	word	word	word	word	word

Solr 1.4:

term position 	1	2	3	4	5	6	7	8
term text 	the	beatl	as	musician	revolv	through	the	antholog
term type 	word	word	word	word	word	word	word	word
source start,end 	0,3	4,11	12,14	15,24	27,35	36,43	44,47	48,57


For debug purposes, we can consider the Solr document as:

<doc>
  <str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
</doc>

I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I work in a library with very rich metadata and very exacting solr fields).  I have attached the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.

For more details, see the erroneously titled email thread "result present in Solr 1.4 but missing in Solr 3.5, dismax only"  started on 2012-02-22 on solr-user@lucene.apache.org.

- Naomi







--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226873#comment-13226873 ] 

Robert Muir commented on LUCENE-3821:
-------------------------------------

{quote}
Robert, this gave me an idea... currently, in case of "collision" between repeaters, we compare them and advance the "lesser" of them (SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be fairly easy to add lookahead to this logic: if one of the two is multi-term, lesser can also do a lookahead. The amount of lookahead can depend on the slop. I'll give it a try before closing this issue.
{quote}

Interesting... its hard to think about for me since the edit distance is a little different, but at least in the
levAutomata case the maximum 'context' the thing ever needs is {{2n+1}}, where n is the distance/slop. I don't 
know if it applies here... but seems like it should be at least an upperbound.

Speaking of which on a related note, I think its possible we can implement a more... exhaustive test for 
SloppyPhraseScorer (rather than relying so much on a random one). The idea would be a twist on 
TestLevenshteinAutomata.assertCharVectors: using an alphabet of terms={0,1} the idea is to test all possible
'automaton structures', for sloppyphrasescorer, the idea would be we have the minimal test method that
tests all the cases...

I'll think on this one...


                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3821:
--------------------------------

    Attachment: LUCENE-3821.patch

updated patch with fixed MFQ.toString(), which prints the problematic doc and queries in case of failure.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3821:
--------------------------------

    Attachment: LUCENE-3821.patch

Patch with fix for this problem. I would expect SloppyPhrase scoring performance to degrade, though I did not measure it.

The single test that still fails (and I think the bug is in ExactPhraseScorer) is testRandomIncreasingSloppiness, and can be recreated like this:
{noformat}
ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtestmethod=testRandomIncreasingSloppiness -Dtests.seed=47267613db69f714:-617bb800c4a3c645:-456a673444fdc184 -Dargs="-Dfile.encoding=UTF-8"
{noformat}
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215191#comment-13215191 ] 

Robert Muir commented on LUCENE-3821:
-------------------------------------

I reviewed the random failures: in all cases it fails, repeated terms are in the query.

                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221867#comment-13221867 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Update: apparently MultiPhraseQuery.toString does not print its "holes".

So the query that failed was not:
{noformat}field:"(j o s) (i b j) (t d)"{noformat}

But rather:
{noformat}"(j o s) ? (i b j) ? ? (t d)"{noformat}

Which is a different story: this query should match the document
{noformat}s o b h j t j z o{noformat}

There is a match for ExactPhraseScorer, but not for Sloppy with slope 1.
So there is still work to do on SloppyPhraseScorer...

(I'll fix MFQ.toString() as well)
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Naomi Dushay (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228118#comment-13228118 ] 

Naomi Dushay commented on LUCENE-3821:
--------------------------------------

The commits from March 10 fix my two failing tests - huzzah!  Thank you so much!   - Naomi
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222958#comment-13222958 ] 

Robert Muir commented on LUCENE-3821:
-------------------------------------

{quote}
I would love it to be the 3rd if I just knew how to do it. Otherwise I like the 2nd, just need to keep in mind that the random test might from time to time create this scenario and so there will be noise in the test builds.
{quote}

I think there is no problem in fixing "some of the bugs" to improve the behavior, even if its still not perfect.

we can take our time thinking of how to handle the remaining scenarios... either way I think we should just go
with your judgement call on this one, since you obviously understand it better than anyone else.

                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-3158) search slop problem introduced somewhere between Solr 1.4 and Solr 3.5

Posted by "Naomi Dushay (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Naomi Dushay updated SOLR-3158:
-------------------------------

    Attachment: solrconfig-test.xml
                schema.xml
    
> search slop problem introduced somewhere between Solr 1.4 and Solr 3.5
> ----------------------------------------------------------------------
>
>                 Key: SOLR-3158
>                 URL: https://issues.apache.org/jira/browse/SOLR-3158
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 3.5
>            Reporter: Naomi Dushay
>         Attachments: schema.xml, solrconfig-test.xml
>
>
> In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
>   "The Beatles as musicians : Revolver through the Anthology"
>   "Color-blindness [print/digital]; its dangers and its detection"
> Both of these queries have a repeated work, and have many terms.  It's not the number of terms or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and Solr 1.4):
>     "International encyclopedia of revolution and protest : 1500 to the present"
> With Robert Muir's help, we have narrowed the problem down to slop  (proximity in lucene QueryParser, query slop in dismax).   I have included debugQuery details for  the Beatles search;  I confirmed the same behavior with the color-blindness search.
> Solr 3.5:   it fails when (query) slop setting isn't 0.
> ----
> lucene QueryParser with proximity set to 1 (or anything > 0) :  no match
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
> lucene QueryParser with proximity set to 0:    result!
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>   6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
>      <snip>
>       48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
>      <snip>
> dismax QueryParser with qs=1:  no match
>       ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>      ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   8.564867 = (MATCH) sum of:
>     4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
>         <snip>
>         48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
>         <snip>
> Solr 1.4:    it works regardless of slop settings
> ----
> lucene QueryParser with any proximity value:    result!
>       ~0
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>       ~1
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
>   5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in 3469163), product of:
>      <snip>
>     48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
>      <snip>
> dismax QueryParser with any qs:    result!
>       qs=0, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=0, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>       qs=1, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=1, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   7.4490223 = (MATCH) sum of:
>   3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1 in 3469163), product of:
>         <snip>
>       48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
>         <snip>
> More information:
> schema.xml:
>   <field name="all_search" type="text" indexed="true" stored="false" />
> solr 3.5:
>       <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="solr.ICUFoldingFilterFactory"/>  
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> solr1.4:
> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
>         <filter class="solr.WordDelimiterFilterFactory" 
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> And the analysis page shows the same results for Solr 3.5 and 1.4
> Solr 3.5:
> position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> keyword 	false	false	false	false	false	false	false	false
> startOffset 	0	4	12	15	27	36	44	48
> endOffset 	3	11	14	24	35	43	47	57
> type 	word	word	word	word	word	word	word	word
> Solr 1.4:
> term position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> term type 	word	word	word	word	word	word	word	word
> source start,end 	0,3	4,11	12,14	15,24	27,35	36,43	44,47	48,57
> For debug purposes, we can consider the Solr document as:
> <doc>
>   <str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
> </doc>
> I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I work in a library with very rich metadata and very exacting solr fields).  I have attached the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.
> For more details, see the erroneously titled email thread "result present in Solr 1.4 but missing in Solr 3.5, dismax only"  started on 2012-02-22 on solr-user@lucene.apache.org.
> - Naomi

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221879#comment-13221879 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I think I understand the cause.

In current implementation there is an assumption that once we landed on the first candidate document, it is possible to check if there are repeating pps, by just comparing the in-doc-positions of the terms. 

Tricky as it is, while this is true for plain PhrasePositions, it is not true for MultiPhrasePositions - assume to MPPs: (a m n) and (b x y), and first candidate document that starts with "a b". The in-doc-positions of the two pps would be 0,1 respectively (for 'a' and 'b') and we would not even detect the fact that there are repetitions, not to mention not putting them in the same group.

MPPs conflicts with current patch in an additional manner: It is now assumed that each repetition can be assigned a repetition group. 

So assume these PPs (and query positions): 
0:a 1:b 3:a 4:b 7:c
There are clearly two repetition groups {0:a, 3:a} and {1:b, 4:b}, 
while 7:c is not a repetition.

But assume these PPs (and query positions): 
0:(a b) 1:(b x) 3:a 4:b 7:(c x)
We end up with a single large repetition group:
{0:(a b) 1:(b x) 3:a 4:b 7:(c x)}

I think if the groups are created correctly at the first candidate document, scorer logic would still work, as a collision is decided only when two pps are in the same in-doc-position. The only impact of MPPs would be performance cost: since repetition groups are larger, it would take longer to check if there are repetitions.

Just need to figure out how to detect repetition groups without relying on in-(first-)doc-positions.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223629#comment-13223629 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

bq. sounds interesting: ExactPhraseScorer really has a lot of useful recent heuristics and optimizations, especially about when to next() versus advance() and such?

next()/advance() will remain, but it would still be more costly than exact - score cache won't play, because freqs really are float in this case, and also there would be more computations on the way. But let's see it working first...
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223560#comment-13223560 ] 

Robert Muir commented on LUCENE-3821:
-------------------------------------

sounds interesting: ExactPhraseScorer really has a lot of useful recent heuristics and optimizations, 
especially about when to next() versus advance() and such?

net/net this idea could possibly improvement the performance overall of SloppyPhraseScorer
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219123#comment-13219123 ] 

Michael McCandless commented on LUCENE-3821:
--------------------------------------------

I just @Ignore'd this test... it's creating a lot of Jenkins noise... but we should fix this bug!!
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Assigned] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen reassigned LUCENE-3821:
-----------------------------------

    Assignee: Doron Cohen
    
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223856#comment-13223856 ] 

Robert Muir commented on LUCENE-3821:
-------------------------------------

{quote}
To prevent two repeating terms from landing on the same document position, we propagate the smaller of them (smaller in its phrase-position, which takes into account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"~2 might match a document "a b", because both "a"'s (query terms) will land on the same document's "a". This is illegal and is prevented by such propagation.

But when one of the repeating terms is a multi-term, it is not possible to know which of the repeating terms to propagate. This is the unsolved bug.
{quote}

Not understanding really how SloppyPhraseScorer works now, but not trying to add confusion to the issue, I can't help but think
this problem is a variant on LevensteinAutomata... in fact that was the motivation for the new test, i just stole the testing
methodology from there and applied it to this!

It seems many things are the same but with a few twists:
* fundamentally we are interleaving the streams from the subscorers into the 'index automaton'
* 'query automaton' is produced from the user-supplied terms
* our 'alphabet' is the terms, and holes from position increment are just an additional symbol.
* just like the LevensteinAutomata case, repeats are problematic because they are different characteristic vectors
* stacked terms at the same position (index or query) just make the automata more complex (so they arent just strings)

I'm not suggesting we try to re-use any of that code at all, i don't think it will work. But I wonder if we can re-use even
some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...

                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) search slop problem introduced somewhere between Solr 1.4 and Solr 3.5

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215154#comment-13215154 ] 

Robert Muir commented on LUCENE-3821:
-------------------------------------

Here's Naomi's original description (moved from description section): including the queries, however I can't 
reproduce with just that phrase of the document (I tried, it passes).

I think its a more complex bug in SloppyPhraseScorer... and I can reproduce similar behavior with another test.

{noformat}
In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
"The Beatles as musicians : Revolver through the Anthology"
"Color-blindness [print/digital]; its dangers and its detection"
Both of these queries have a repeated work, and have many terms. It's not the number of terms or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and Solr 1.4):
"International encyclopedia of revolution and protest : 1500 to the present"

With Robert Muir's help, we have narrowed the problem down to slop (proximity in lucene QueryParser, query slop in dismax). I have included debugQuery details for the Beatles search; I confirmed the same behavior with the color-blindness search.

Solr 3.5: it fails when (query) slop setting isn't 0.

lucene QueryParser with proximity set to 1 (or anything > 0) : no match
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
final query: all_search:"the beatl as musician revolv through the antholog"~1

lucene QueryParser with proximity set to 0: result!
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"
final query: all_search:"the beatl as musician revolv through the antholog"

6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
<snip>
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
<snip>

dismax QueryParser with qs=1: no match
ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

dismax QueryParser with qs=0: result!
ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

8.564867 = (MATCH) sum of:
4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
<snip>
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
<snip>

Solr 1.4: it works regardless of slop settings

lucene QueryParser with any proximity value: result!
~0
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"
final query: all_search:"the beatl as musician revolv through the antholog"
~1
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
final query: all_search:"the beatl as musician revolv through the antholog"~1

5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in 3469163), product of:
<snip>
48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
<snip>

dismax QueryParser with any qs: result!
qs=0, ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
qs=0, ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
dismax QueryParser with qs=0: result!
qs=1, ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
qs=1, ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

7.4490223 = (MATCH) sum of:
3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1 in 3469163), product of:
<snip>
48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
<snip>

More information:

schema.xml:
<field name="all_search" type="text" indexed="true" stored="false" />

solr 3.5:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldtype>

solr1.4:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldtype>

And the analysis page shows the same results for Solr 3.5 and 1.4

Solr 3.5:

position 1 2 3 4 5 6 7 8
term text the beatl as musician revolv through the antholog
keyword false false false false false false false false
startOffset 0 4 12 15 27 36 44 48
endOffset 3 11 14 24 35 43 47 57
type word word word word word word word word

Solr 1.4:

term position 1 2 3 4 5 6 7 8
term text the beatl as musician revolv through the antholog
term type word word word word word word word word
source start,end 0,3 4,11 12,14 15,24 27,35 36,43 44,47 48,57

For debug purposes, we can consider the Solr document as:

<doc>
<str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
</doc>

I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I work in a library with very rich metadata and very exacting solr fields). I have attached the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.

For more details, see the erroneously titled email thread "result present in Solr 1.4 but missing in Solr 3.5, dismax only" started on 2012-02-22 on solr-user@lucene.apache.org.

    Naomi

{noformat}
                
> search slop problem introduced somewhere between Solr 1.4 and Solr 3.5
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: schema.xml, solrconfig-test.xml
>
>
> In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
>   "The Beatles as musicians : Revolver through the Anthology"
>   "Color-blindness [print/digital]; its dangers and its detection"
> Both of these queries have a repeated work, and have many terms.  It's not the number of terms or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and Solr 1.4):
>     "International encyclopedia of revolution and protest : 1500 to the present"
> With Robert Muir's help, we have narrowed the problem down to slop  (proximity in lucene QueryParser, query slop in dismax).   I have included debugQuery details for  the Beatles search;  I confirmed the same behavior with the color-blindness search.
> Solr 3.5:   it fails when (query) slop setting isn't 0.
> ----
> lucene QueryParser with proximity set to 1 (or anything > 0) :  no match
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
> lucene QueryParser with proximity set to 0:    result!
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>   6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
>      <snip>
>       48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
>      <snip>
> dismax QueryParser with qs=1:  no match
>       ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>      ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   8.564867 = (MATCH) sum of:
>     4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
>         <snip>
>         48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
>         <snip>
> Solr 1.4:    it works regardless of slop settings
> ----
> lucene QueryParser with any proximity value:    result!
>       ~0
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>       ~1
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
>   5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in 3469163), product of:
>      <snip>
>     48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
>      <snip>
> dismax QueryParser with any qs:    result!
>       qs=0, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=0, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>       qs=1, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=1, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   7.4490223 = (MATCH) sum of:
>   3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1 in 3469163), product of:
>         <snip>
>       48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
>         <snip>
> More information:
> schema.xml:
>   <field name="all_search" type="text" indexed="true" stored="false" />
> solr 3.5:
>       <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="solr.ICUFoldingFilterFactory"/>  
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> solr1.4:
> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
>         <filter class="solr.WordDelimiterFilterFactory" 
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> And the analysis page shows the same results for Solr 3.5 and 1.4
> Solr 3.5:
> position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> keyword 	false	false	false	false	false	false	false	false
> startOffset 	0	4	12	15	27	36	44	48
> endOffset 	3	11	14	24	35	43	47	57
> type 	word	word	word	word	word	word	word	word
> Solr 1.4:
> term position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> term type 	word	word	word	word	word	word	word	word
> source start,end 	0,3	4,11	12,14	15,24	27,35	36,43	44,47	48,57
> For debug purposes, we can consider the Solr document as:
> <doc>
>   <str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
> </doc>
> I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I work in a library with very rich metadata and very exacting solr fields).  I have attached the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.
> For more details, see the erroneously titled email thread "result present in Solr 1.4 but missing in Solr 3.5, dismax only"  started on 2012-02-22 on solr-user@lucene.apache.org.
> - Naomi

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221746#comment-13221746 ] 

Michael McCandless commented on LUCENE-3821:
--------------------------------------------

Doron do you have the seed for that failure?  I can dig on the ExactPhraseScorer...
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221840#comment-13221840 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

The remaining failure still happens with the updated patch (same seed), and still seems to me like an ExactPhraseScorer bug. 

However, it is probably not a simple one I think, because when adding to TestMultiPhraseQuery, it passes, that is, no documents are matched, as expected, although this supposedly created the exact scenario that failed above. 

Perhaps ExactPhraseScorer behavior too is influenced by other docs processed so far.

{code:title=Add this to TestMultiPhraseQuery}
  public void test_LUCENE_XYZ() throws Exception {
    Directory indexStore = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, indexStore);
    add("s o b h j t j z o", "LUCENE-XYZ", writer);
    
    IndexReader reader = writer.getReader();
    IndexSearcher searcher = newSearcher(reader);
    
    MultiPhraseQuery q = new MultiPhraseQuery();
    q.add(new Term[] {new Term("body", "j"), new Term("body", "o"), new Term("body", "s")});
    q.add(new Term[] {new Term("body", "i"), new Term("body", "b"), new Term("body", "j")});
    q.add(new Term[] {new Term("body", "t"), new Term("body", "d")});
    assertEquals("Wrong number of hits", 0,
        searcher.search(q, null, 1).totalHits);
    
    // just make sure no exc:
    searcher.explain(q, 0);
    
    writer.close();
    searcher.close();
    reader.close();
    indexStore.close();
  }
{code}
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219156#comment-13219156 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Fails here too like this: 

ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtestmethod=testRandomIncreasingSloppiness -Dtests.seed=-171bbb992c697625:203709d611c854a5:1ca48cb6d33b3f74 -Dargs="-Dfile.encoding=UTF-8"

I'll look into it
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226623#comment-13226623 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Committed:
- r1299077  3x
- r1299112  trunk

bq. I would be glad to try out a nightly build with the patch as is against our tests - I would be glad to get the 80% solution if it fixes my bug.

It's in now...

bq.  But I wonder if we can re-use even some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...

Robert, this gave me an idea... currently, in case of "collision" between repeaters, we compare them and advance the "lesser" of them (SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be fairly easy to add lookahead to this logic: if one of the two is multi-term, lesser can also do a lookahead. The amount of lookahead can depend on the slop. I'll give it a try before closing this issue.

                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223700#comment-13223700 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I'm afraid it won't solve the problem.

The complicity of SloppyPhraseScorer stems firstly from the slop.
That part is handled in the scorer for long time.

Two additional complications are repeating terms, and multi-term phrases.
Each one of these, separately, is handled as well.
Their combination however, is the cause for this discussion.

To prevent two repeating terms from landing on the same document position, we propagate the smaller of them (smaller in its phrase-position, which takes into account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"~2 might match a document "a b", because both "a"'s (query terms) will land on the same document's "a". This is illegal and is prevented by such propagation. 

But when one of the repeating terms is a multi-term, it is not possible to know which of the repeating terms to propagate. This is the unsolved bug.

Now, back to current ExactPhraseScorer.
It does not have this problem with repeating terms.
But not because of the different algorithm - rather because of the different scenario.
It does not have this problem because exact phrase scoring does not have it.
In exact phrase scoring, a match is declared only when all PPs are in the same phrase position.
Recall that phrase position = doc-position - query-offset, it is visible that when two PPs with different query offset are in the same phrase-position, their doc-position cannot be the same, and therefore no special handling is needed for repeating terms in exact phrase scorers.

However, once we will add that slopy-decaying frequency, we will match in certain posIndex, different phrase-positions. This is because of the slop. So they might land on the same doc-position, and then we start again...

This is really too bad. Sorry for the lengthy post, hopefully this would help when someone wants to get into this.

Back to option 2.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221737#comment-13221737 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I understand the problem. 

It all has to do - as Robert mentioned - with the repeating terms in the phrase query. I am working on a patch - it will change the way that repeats are handled. 

Repeating PPs require additional computation - and current SloppyPhraseScorer attempts to do that additional work efficiently, but over simplifies in that and fail to cover all cases. 

In the core of things, each time a repeating PP is selected (from the queue) and  propagated, *all* its sibling repeaters are propagated as well, to prevent a case that two repeating PPs point to the same document position (which was the bug that originally triggered handling repeats in this code). 

But this is wrong, because it propagates all siblings repeaters, and misses some cases.

Also, the code is hard to read, as Mike noted in LUCENE-2410 ([this comment|https://issues.apache.org/jira/browse/LUCENE-2410?focusedCommentId=12879443&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12879443]) ).

So this is a chance to also make the code more maintainable.

I have a working version which is not ready to commit yet, and all the tests pass, except for one which I think is a bug in ExactPhraseScorer, but maybe i am missing something. 

The case that fails is this:

{noformat}
AssertionError: Missing in super-set: doc 706
q1: field:"(j o s) (i b j) (t d)"
q2: field:"(j o s) (i b j) (t d)"~1
td1: [doc=706 score=7.7783184 shardIndex=-1, doc=175 score=6.222655 shardIndex=-1]
td2: [doc=523 score=5.5001016 shardIndex=-1, doc=957 score=5.5001016 shardIndex=-1, doc=228 score=4.400081 shardIndex=-1, doc=357 score=4.400081 shardIndex=-1, doc=390 score=4.400081 shardIndex=-1, doc=503 score=4.400081 shardIndex=-1, doc=602 score=4.400081 shardIndex=-1, doc=757 score=4.400081 shardIndex=-1, doc=758 score=4.400081 shardIndex=-1]
doc 706: Document<stored,indexed,tokenized<field:s o b h j t j z o>>
{noformat}

It seems that q1 too should not match this document?
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3821:
--------------------------------

    Attachment: LUCENE-3821_test.patch

Here's an improved test that fails every time.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223554#comment-13223554 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

OK great! 

If you did not point a problem with this up front there's a good chance it will work and I'd like to give it a try. 

I have a first patch - not working or anything - it opens ExactPhraseScorer a bit for extensions and adds a class (temporary name) - NonExactPhraseScorer. 

The idea is to hide in the ChunkState the details of decaying frequencies due to slops. I will try it over the weekend. If we can make it this way, I'd rather do it in this issue rather than committing the other new code for the fix and then replacing it. If that won't go quick, I'll commit the (other) changes to SloppyPhraseScorer and start a new issue.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223077#comment-13223077 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Thanks Robert, okay, I'll continue with option 2 then.

In addition, perhaps should try harder for a sloppy version of current ExactPhraseScorer, for both performance and correctness reasons. 

In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the conditions for a match (still) holds.  

A sloppy version of this, with N terms and slop=S could increment differently:
{noformat}
1 + N*S        at posIndex
1 + N*S - 1    at posIndex-1 and posIndex+1
1 + N*S - 2 at posIndex-2 and posIndex+3
...
1 + N*S - S at posIndex-S and posIndex+S
{noformat}

For S=0, this falls back to only increment by 1 and only at posIndex, same as the ExactPhraseScorer, which makes sense.

Also, the success criteria in ExactPhraseScorer, when checking term k, is, before adding up 1 for term k:
* count[posIndex] == k-1
Or, after adding up 1 for term k:
* count[posIndex] == k

In the sloppy version the criteria (after adding up term k) would be:
* count[posIndex] >= k*(1+N*S)-S

Again, for S=0 this falls to the ExactPhraseScorer logic:
* count[posIndex] >= k  

Mike (and all), correctness wise, what do you think?

If you are wondering why the increment at the actual position is (1 + N*S) - it allows to match only posIndexes where all terms contributed something.

I drew an example with 5 terms and slop=2 and so far it seems correct.

Also tried 2 terms and slop=5, this seems correct as well, just that, when there is a match, several posIndexes will contribute to the score of the same match. I think this is not too bad, as for these parameters, same behavior would be in all documents. I would be especially forgiving for this if we this way get some of the performance benefits of the ExactPhraseScorer.

If we agree on correctness, need to understand how to implement it, and consider the performance effect. The tricky part is to increment at posIndex-n. Say there are 3 terms in the query and one of the terms is found at indexes 10, 15, 18. Assume the slope is 2. Since N=3, the max increment is:
- 1 + N*S = 1 + 3*2 = 7.

So the increments for this term would be (pos, incr):
{noformat}
Pos   Increment
---   ---------
 8  ,  5
 9  ,  6
10  ,  7
11  ,  6
12  ,  5
13  ,  5
14  ,  6
15  ,  7   =  max(7,5)
16  ,  6   =  max(6,5)
17  ,  6   =  max(5,6)
18  ,  7
19  ,  6
20  ,  5
{noformat}

So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we do not know yet about the contribution of posIndex 18, which is 6, and should be used instead of 5. So some look-ahead (or some fix-back) is required, which will complicate the code.

If this seems promising, should probably open a new issue for it, just wanted to get some feedback first.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3821:
--------------------------------

    Attachment: LUCENE-3821-SloppyDecays.patch

Patch adds NonExactPhraseScorer (temporary name) as discussed above - work in progress, it does not yet do any sloppy matching or scoring.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Moved] (LUCENE-3821) search slop problem introduced somewhere between Solr 1.4 and Solr 3.5

Posted by "Robert Muir (Moved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir moved SOLR-3158 to LUCENE-3821:
-------------------------------------------

          Component/s:     (was: search)
        Lucene Fields: New
    Affects Version/s:     (was: 3.5)
                       4.0
                       3.5
                  Key: LUCENE-3821  (was: SOLR-3158)
              Project: Lucene - Java  (was: Solr)
    
> search slop problem introduced somewhere between Solr 1.4 and Solr 3.5
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: schema.xml, solrconfig-test.xml
>
>
> In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
>   "The Beatles as musicians : Revolver through the Anthology"
>   "Color-blindness [print/digital]; its dangers and its detection"
> Both of these queries have a repeated work, and have many terms.  It's not the number of terms or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and Solr 1.4):
>     "International encyclopedia of revolution and protest : 1500 to the present"
> With Robert Muir's help, we have narrowed the problem down to slop  (proximity in lucene QueryParser, query slop in dismax).   I have included debugQuery details for  the Beatles search;  I confirmed the same behavior with the color-blindness search.
> Solr 3.5:   it fails when (query) slop setting isn't 0.
> ----
> lucene QueryParser with proximity set to 1 (or anything > 0) :  no match
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
> lucene QueryParser with proximity set to 0:    result!
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>   6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
>      <snip>
>       48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
>      <snip>
> dismax QueryParser with qs=1:  no match
>       ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>      ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   8.564867 = (MATCH) sum of:
>     4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
>         <snip>
>         48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
>         <snip>
> Solr 1.4:    it works regardless of slop settings
> ----
> lucene QueryParser with any proximity value:    result!
>       ~0
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>       ~1
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
>   5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in 3469163), product of:
>      <snip>
>     48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
>      <snip>
> dismax QueryParser with any qs:    result!
>       qs=0, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=0, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>       qs=1, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=1, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   7.4490223 = (MATCH) sum of:
>   3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1 in 3469163), product of:
>         <snip>
>       48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
>         <snip>
> More information:
> schema.xml:
>   <field name="all_search" type="text" indexed="true" stored="false" />
> solr 3.5:
>       <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="solr.ICUFoldingFilterFactory"/>  
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> solr1.4:
> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
>         <filter class="solr.WordDelimiterFilterFactory" 
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> And the analysis page shows the same results for Solr 3.5 and 1.4
> Solr 3.5:
> position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> keyword 	false	false	false	false	false	false	false	false
> startOffset 	0	4	12	15	27	36	44	48
> endOffset 	3	11	14	24	35	43	47	57
> type 	word	word	word	word	word	word	word	word
> Solr 1.4:
> term position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> term type 	word	word	word	word	word	word	word	word
> source start,end 	0,3	4,11	12,14	15,24	27,35	36,43	44,47	48,57
> For debug purposes, we can consider the Solr document as:
> <doc>
>   <str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
> </doc>
> I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I work in a library with very rich metadata and very exacting solr fields).  I have attached the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.
> For more details, see the erroneously titled email thread "result present in Solr 1.4 but missing in Solr 3.5, dismax only"  started on 2012-02-22 on solr-user@lucene.apache.org.
> - Naomi

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3821:
--------------------------------

    Attachment: LUCENE-3821.patch

bq. Hmm patch has this: ... import backport.api...

Oops, here's a fixed patch, also added some comments, and removed the @Ignore from the test

bq. I'm going to be ecstatic if that crazy test finds bugs both in exact and sloppy phrase scorers :)

It is a great test! Interestingly one thing it exposed is the dependency of the SloppyPhraseScorer in the order of PPs in PhraseScorer when phraseFreq() is invoked. The way things work in the super class, that order depends on the content of previously processed documents. This fix removes that wrong dependency, of course. The point is that deliberately devising a test that exposes such a bug seems almost impossible: first you need to think about such a case, and if you did, writing a test that would create this specific scenario is buggy by itself. Praise to random testing, and this random test in particular.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Naomi Dushay (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223839#comment-13223839 ] 

Naomi Dushay commented on LUCENE-3821:
--------------------------------------

I would be glad to try out a nightly build with the patch as is against our tests - I would be glad to get the 80% solution if it fixes my bug.  I haven't compiled from source yet, though, so am inclined to wait for the patch getting posted to the nightly.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221750#comment-13221750 ] 

Michael McCandless commented on LUCENE-3821:
--------------------------------------------

Hmm patch has this:

{noformat}
import backport.api.edu.emory.mathcs.backport.java.util.Arrays;
{noformat}
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226494#comment-13226494 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

{quote}
Not understanding really how SloppyPhraseScorer works now, but not trying to add confusion to the issue, I can't help but think this problem is a variant on LevensteinAutomata... in fact that was the motivation for the new test, i just stole the testing methodology from there and applied it to this!
{quote}

Interesting! I was not aware of this. I went and read some about this automaton, It is relevant.

{quote}
It seems many things are the same but with a few twists:

* fundamentally we are interleaving the streams from the subscorers into the 'index automaton'
'query automaton' is produced from the user-supplied terms
{quote}

True. In fact, the current code works hard to decide on the "correct interleaving order" - while if we had a "Perfect Levenstein Automaton" that took care of the computation we could just interleave, in the term position order (forget about phrase position and all that) and let the automaton compute the distance. 

This might capture the difficulty in making the sloppy phrase scorer correct: it started with the algorithm that was optimized for exact matching, and adopted (hacked?) it for approximate matching.

Instead, starting with a model that fits approximate matching, might be easier and cleaner. I like that. 

{quote}
* our 'alphabet' is the terms, and holes from position increment are just an additional symbol.
* just like the LevensteinAutomata case, repeats are problematic because they are different characteristic vectors
* stacked terms at the same position (index or query) just make the automata more complex (so they arent just strings)

I'm not suggesting we try to re-use any of that code at all, i don't think it will work. But I wonder if we can re-use even
some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...
{quote}

I agree. I'll think of this.

In the meantime I'll commit this partial fix.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Doron Cohen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3821:
--------------------------------

    Attachment: LUCENE-3821.patch

Attached updated patch. 

Repeating PPs with multi-Phrase-query is handled as well.

This called for more cases in the sloppy phrase scorer and more code, and, although I think the code is cleaner now, I don't know to what extent is it easier to maintain. 

It definitely fixes wrong behavior that exists in current 3x and trunk (patch is for 3x).

However, although the random test passes for me even with -Dtests.iter=2000, it is possible to "break the scorer" - that is, create a document and a query which should match each other but would not. 

The patch adds just such a case as an @Ignored test case:  TestMultiPhraseQuery.testMultiSloppyWithRepeats(). 

I don't see how to solve this specific case in the context of current sloppy phrase scorer. 

So there are 3 options:
# leave things as they are
# commit this patch and for now document the failing scenario (also keep it in the ignored test case). 
# devise a different algorithm for this.

I would love it to be the 3rd if I just knew how to do it. Otherwise I like the 2nd, just need to keep in mind that the random test might from time to time create this scenario and so there will be noise in the test builds.

Preferences?
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223410#comment-13223410 ] 

Michael McCandless commented on LUCENE-3821:
--------------------------------------------

Cool!

I haven't fully thought this out (sloppy phrase matching is hard to think about!), but, tentatively, I think this is correct...?
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221755#comment-13221755 ] 

Robert Muir commented on LUCENE-3821:
-------------------------------------

Thanks for digging into the problem Doron!

I'm going to be ecstatic if that crazy test finds bugs both in exact and sloppy phrase scorers :)
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3821:
--------------------------------

    Description: 
The general bug is a case where a phrase with no slop is found,
but if you add slop its not.

I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
jenkins just hasn't had enough time to chew on it.

ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x


  was:
In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
  "The Beatles as musicians : Revolver through the Anthology"
  "Color-blindness [print/digital]; its dangers and its detection"
Both of these queries have a repeated work, and have many terms.  It's not the number of terms or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and Solr 1.4):
    "International encyclopedia of revolution and protest : 1500 to the present"

With Robert Muir's help, we have narrowed the problem down to slop  (proximity in lucene QueryParser, query slop in dismax).   I have included debugQuery details for  the Beatles search;  I confirmed the same behavior with the color-blindness search.


Solr 3.5:   it fails when (query) slop setting isn't 0.
----
lucene QueryParser with proximity set to 1 (or anything > 0) :  no match
  URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
  final query:  all_search:"the beatl as musician revolv through the antholog"~1

lucene QueryParser with proximity set to 0:    result!
  URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
  final query:  all_search:"the beatl as musician revolv through the antholog"

  6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
     <snip>
      48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
     <snip>

dismax QueryParser with qs=1:  no match
      ps=0
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
  final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
      ps=1
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
  final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

dismax QueryParser with qs=0:    result!
     ps=0
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
  final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
      ps=1
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
  final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

  8.564867 = (MATCH) sum of:
    4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
        <snip>
        48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
        <snip>


Solr 1.4:    it works regardless of slop settings
----
lucene QueryParser with any proximity value:    result!
      ~0
  URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
  final query:  all_search:"the beatl as musician revolv through the antholog"
      ~1
  URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
  final query:  all_search:"the beatl as musician revolv through the antholog"~1

  5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in 3469163), product of:
     <snip>
    48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
     <snip>

dismax QueryParser with any qs:    result!
      qs=0, ps=0
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
   final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
      qs=0, ps=1
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
   final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01
dismax QueryParser with qs=0:    result!
      qs=1, ps=0
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
   final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog")~0.01
      qs=1, ps=1
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
   final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the antholog"~1)~0.01

  7.4490223 = (MATCH) sum of:
  3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1 in 3469163), product of:
        <snip>
      48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
        <snip>


More information:

schema.xml:
  <field name="all_search" type="text" indexed="true" stored="false" />

solr 3.5:
      <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.ICUFoldingFilterFactory"/>  
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldtype>

solr1.4:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
        <filter class="solr.WordDelimiterFilterFactory" 
          splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 
          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 
          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldtype>


And the analysis page shows the same results for Solr 3.5 and 1.4


Solr 3.5:

position 	1	2	3	4	5	6	7	8
term text 	the	beatl	as	musician	revolv	through	the	antholog
keyword 	false	false	false	false	false	false	false	false
startOffset 	0	4	12	15	27	36	44	48
endOffset 	3	11	14	24	35	43	47	57
type 	word	word	word	word	word	word	word	word

Solr 1.4:

term position 	1	2	3	4	5	6	7	8
term text 	the	beatl	as	musician	revolv	through	the	antholog
term type 	word	word	word	word	word	word	word	word
source start,end 	0,3	4,11	12,14	15,24	27,35	36,43	44,47	48,57


For debug purposes, we can consider the Solr document as:

<doc>
  <str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
</doc>

I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I work in a library with very rich metadata and very exacting solr fields).  I have attached the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.

For more details, see the erroneously titled email thread "result present in Solr 1.4 but missing in Solr 3.5, dismax only"  started on 2012-02-22 on solr-user@lucene.apache.org.

- Naomi







        Summary: SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.  (was: search slop problem introduced somewhere between Solr 1.4 and Solr 3.5)

Moving to comments...
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Posted by "Naomi Dushay (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215172#comment-13215172 ] 

Naomi Dushay commented on LUCENE-3821:
--------------------------------------

How's this for a document:

<doc>
<str name="id">1</str>
<str name="title_245a_display">The Beatles as musicians</str>
<str name="title_245a_search">The Beatles as musicians :</str>
<str name="title_245c_display">Walter Everett</str>
<str name="title_display">The Beatles as musicians : Revolver through the Anthology</str>
<str name="title_full_display">The Beatles as musicians : Revolver through the Anthology  / Walter Everett.</str>
<str name="title_245_search">The Beatles as musicians : Revolver through the Anthology  / Walter Everett.</str>
<str name="title_sort">Beatles as musicians Revolver through the Anthology</str>
<str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
</doc>
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org