You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Cao Manh Dat (JIRA)" <ji...@apache.org> on 2016/11/04 11:23:59 UTC

[jira] [Comment Edited] (SOLR-8893) Wrong TermVector docfreq calculation with enabled ExactStatsCache

    [ https://issues.apache.org/jira/browse/SOLR-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636065#comment-15636065 ] 

Cao Manh Dat edited comment on SOLR-8893 at 11/4/16 11:23 AM:
--------------------------------------------------------------

I'm digging into this problem, but it turns out that the purpose ExactStatsCache is to compute correct tf-idf for terms appear in the query only. Not arbitrary terms in case of TermVectorComponent. 

It will be very costly if we do that, so should we support this kind of operation for ExactStatsCache?


was (Author: caomanhdat):
I'm digging into this problem, but it turns out that the purpose ExactStatsCache is to compute correct tf-idf for terms appear in the query only. Not arbitrary terms in case of TermVectorComponent.

> Wrong TermVector docfreq calculation with enabled ExactStatsCache
> -----------------------------------------------------------------
>
>                 Key: SOLR-8893
>                 URL: https://issues.apache.org/jira/browse/SOLR-8893
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.5
>            Reporter: Andreas Daffner
>
> Hi,
> we are currently facing the issue that some calculated values of the TV component are obviously wrong with enabled
> ExactStatsCache. --> shard-wide TV docfreq calculation
> This problem is subsequent to
> SOLR-8459 NPE using TermVectorComponent in combinition with ExactStatsCache
> Maybe the problem is very trivial and we configured something wrong ...
> So lets go deeper into that problem:
> 1) The problem in summary:
> ==================
> We are requesting with enabled "tv.df", "tv.tf" and "tv.tf_idf" --> 
> {code}
> tv.df=true&tv.tf_idf=true&tv.tf=true
> {code}
> additionally for debugging purposes we are requesting by calling 
> {code}
> termfreq("site_term_maincontent","abakus"),docfreq("site_maincontent_term_wdf","abakus"),ttf("site_maincontent_term_wdf","abakus")
> {code}
> Our findings are:
> - the tv.tf as well as the termfreq seems to be correct
> - the tv.df as well as the docfreq is obviously wrong
> - the tv.tf_idf as well as ttf is wrong as well, I guess as subsequent fault of the tv.df (docfeq)
> 2) What we have:
> ===========
> schema.xml:
> {code}
> ...
>         <field name="site_maincontent_term_wdf" type="text_token_wdf" indexed="true" stored="true" termVectors="true"
>                termPositions="true" termOffsets="true"/>
> ...
>         <fieldType name="text_token_wdf" class="solr.TextField" positionIncrementGap="100">
>             <analyzer>
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>             </analyzer>
>         </fieldType>
> ...
> {code}
> solrconfig.xml:
> {code}
> ...
>     <statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
> ...
>     <searchComponent name="tvComponent" class="org.apache.solr.handler.component.TermVectorComponent"/>
>     <requestHandler name="/tvrh" class="org.apache.solr.handler.component.SearchHandler">
>         <lst name="defaults">
>             <bool name="tv">true</bool>
>         </lst>
>         <arr name="last-components">
>             <str>tvComponent</str>
>         </arr>
>     </requestHandler>
> ...
> {code}
> You can find out any details here:
> http://149.202.5.192:8820/solr/#/SingleDomainSite_34_shard1_replica1
> 3) Examples
> ========
> If you are calling this link you can see that there are 6 existent documents containing the word "abakus" in the field "site_maincontent_term_wdf" ...
> http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?q=site_maincontent_term_wdf%3Aabakus+AND+site_headercode%3A200&shards.qt=%2Ftvrh&tv.fl=site_maincontent_term_wdf&tv.df=true&tv.tf_idf=true&tv.tf=true&fl=site_url_id,site_url,termfreq%28%22site_term_maincontent%22,%22abakus%22%29,docfreq%28%22site_maincontent_term_wdf%22,%22abakus%22%29,ttf%28%22site_maincontent_term_wdf%22,%22abakus%22%29
> But if you are looking into the field "docfreq" in the output documents, it is incorrect and always different (sould be always the same ...).
> "docfreq(field,term) returns the number of documents that contain the term in the field. This is a constant (the same value for all documents in the index)."
> Here is a link with enabled shards.info:
> http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?&wt=xml&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=10&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&shards.qt=/tvrh&shards.info=true
> Here is a link with enabled debug:
> http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?omitHeader=true&shards.qt=%2Ftvrh&wt=xml&json.nl=flat&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=1000&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&debugQuery=true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org