You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jack Krupansky (JIRA)" <ji...@apache.org> on 2013/05/23 13:58:20 UTC

[jira] [Comment Edited] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number

    [ https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658718#comment-13658718 ] 

Jack Krupansky edited comment on SOLR-4824 at 5/23/13 11:58 AM:
----------------------------------------------------------------

Lucene FuzzyQuery has a parameter named "maxExpansions", which defaults to 50, which I believe is the largest number of candidate terms the fuzzy query will "rewite", so that once you have that many matches [of terms, not documents], I don't think any more will be found. Robert or one of the other Lucene experts can confirm.

At the Lucene level this can be changed, with the FuzzyQuery(Term term, int maxEdits, int prefixLength, int maxExpansions, boolean transpositions) constructor, but the Solr query parser uses the FuzzyQuery(Term term, int maxEdits, int prefixLength) constructor, so there is no provision for overriding that limit of 50.
Also note that even in Lucene maxExpansions is limited to maxBooleanQueries, which would be 1024 unless you override that in solrconfig. Not that that would do you any good unless you had a query parser that let you set maxExpansions.

Still, that is a reasonable enhancement request.

                
      was (Author: jkrupan):
    Lucene FuzzyQuery has a parameter named "maxExpansions", which defaults to 50, which I believe is the largest number of candidate terms the fuzzy query will "rewite", so that once you have that many matches, I don't think any more will be found. Robert or one of the other Lucene experts can confirm.

At the Lucene level this can be changed, with the FuzzyQuery(Term term, int maxEdits, int prefixLength, int maxExpansions, boolean transpositions) constructor, but the Solr query parser uses the FuzzyQuery(Term term, int maxEdits, int prefixLength) constructor, so there is no provision for overriding that limit of 50.
Also note that even in Lucene maxExpansions is limited to maxBooleanQueries, which would be 1024 unless you override that in solrconfig. Not that that would do you any good unless you had a query parser that let you set maxExpansions.

Still, that is a reasonable enhancement request.

                  
> Fuzzy / Faceting results are changed after ingestion of documents past a certain number 
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-4824
>                 URL: https://issues.apache.org/jira/browse/SOLR-4824
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.2, 4.3
>         Environment: Ubuntu 12.04 LTS 12.04.2 
> jre1.7.0_17
> jboss-as-7.1.1.Final
>            Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, I found that after a certain number of documents were ingested the fuzzy query had drastically lower number of results.  We have approximately 18,000 documents per day and after ingesting approximately 40 days of documents, the next incremental day of documents results in a lower number of results of a fuzzy search.
> The query :  http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> <response><lst name="responseHeader">
> <int name="status">0</int><int name="QTime">2349</int><lst name="params"><str name="facet">on</str><str name="fl">date</str><str name="facet.sort"/>
> <str name="q">cc:worde~1</str><str name="facet.field">date</str></lst></lst><result name="response" numFound="362803" start="0"></result>
> <lst name="facet_counts"><lst name="facet_queries"/><lst name="facet_fields"><lst name="date">
> <int name="2012-12-31">2866</int>
> <int name="2013-01-01">11372</int>
> <int name="2013-01-02">11514</int>
> <int name="2013-01-03">12015</int>
> <int name="2013-01-04">11746</int>
> <int name="2013-01-05">10853</int>
> <int name="2013-01-06">11053</int>
> <int name="2013-01-07">11815</int>
> <int name="2013-01-08">11427</int>
> <int name="2013-01-09">11475</int>
> <int name="2013-01-10">11461</int>
> <int name="2013-01-11">12058</int>
> <int name="2013-01-12">11335</int>
> <int name="2013-01-13">12039</int>
> <int name="2013-01-14">12064</int>
> <int name="2013-01-15">12234</int>
> <int name="2013-01-16">12545</int>
> <int name="2013-01-17">11766</int>
> <int name="2013-01-18">12197</int>
> <int name="2013-01-19">11414</int>
> <int name="2013-01-20">11633</int>
> <int name="2013-01-21">12863</int>
> <int name="2013-01-22">12378</int>
> <int name="2013-01-23">11947</int>
> <int name="2013-01-24">11822</int>
> <int name="2013-01-25">11882</int>
> <int name="2013-01-26">10474</int>
> <int name="2013-01-27">11051</int>
> <int name="2013-01-28">11776</int>
> <int name="2013-01-29">11957</int>
> <int name="2013-01-30">11260</int>
> <int name="2013-01-31">8511</int>
> </lst></lst><lst name="facet_dates"/><lst name="facet_ranges"/></lst></response>
> Once the 40 days of documents ingested threshold is crossed the results drop as show below for the same query
> <response><lst name="responseHeader">
> <int name="status">0</int><int name="QTime">2</int><lst name="params"><str name="facet">on</str><str name="fl">date</str><str name="facet.sort"/><str name="q">cc:worde~1</str><str name="facet.field">date</str></lst></lst>
> <result name="response" numFound="1338" start="0"></result>
> <lst name="facet_counts"><lst name="facet_queries"/><lst name="facet_fields"><lst name="date">
> <int name="2012-12-31">0</int>
> <int name="2013-01-01">41</int>
> <int name="2013-01-02">21</int>
> <int name="2013-01-03">24</int>
> <int name="2013-01-04">19</int>
> <int name="2013-01-05">9</int>
> <int name="2013-01-06">11</int>
> <int name="2013-01-07">17</int>
> <int name="2013-01-08">14</int>
> <int name="2013-01-09">24</int>
> <int name="2013-01-10">43</int>
> <int name="2013-01-11">14</int>
> <int name="2013-01-12">52</int>
> <int name="2013-01-13">57</int>
> <int name="2013-01-14">25</int>
> <int name="2013-01-15">17</int>
> <int name="2013-01-16">34</int>
> <int name="2013-01-17">11</int>
> <int name="2013-01-18">16</int>
> <int name="2013-01-19">121</int>
> <int name="2013-01-20">33</int>
> <int name="2013-01-21">26</int>
> <int name="2013-01-22">59</int>
> <int name="2013-01-23">27</int>
> <int name="2013-01-24">10</int>
> <int name="2013-01-25">9</int>
> <int name="2013-01-26">6</int>
> <int name="2013-01-27">16</int>
> <int name="2013-01-28">11</int>
> <int name="2013-01-29">15</int>
> <int name="2013-01-30">21</int>
> <int name="2013-01-31">109</int>
> <int name="2013-02-01">11</int>
> <int name="2013-02-02">7</int>
> <int name="2013-02-03">10</int>
> <int name="2013-02-04">8</int>
> <int name="2013-02-05">13</int>
> <int name="2013-02-06">75</int>
> <int name="2013-02-07">77</int>
> <int name="2013-02-08">31</int>
> <int name="2013-02-09">35</int>
> <int name="2013-02-10">22</int>
> <int name="2013-02-11">18</int>
> <int name="2013-02-12">11</int>
> <int name="2013-02-13">68</int>
> <int name="2013-02-14">40</int>
> </lst></lst><lst name="facet_dates"/><lst name="facet_ranges"/></lst></response>
> I have also tested this with different months of data and have seen the same issue  around the number of documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org