You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Cao Manh Dat (JIRA)" <ji...@apache.org> on 2016/11/04 11:23:59 UTC

[jira] [Commented] (SOLR-8893) Wrong TermVector docfreq calculation with enabled ExactStatsCache

    [ https://issues.apache.org/jira/browse/SOLR-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636065#comment-15636065 ] 

Cao Manh Dat commented on SOLR-8893:
------------------------------------

I'm digging into this problem, but it turns out that the purpose ExactStatsCache is to compute correct tf-idf for terms appear in the query only. Not arbitrary terms in case of TermVectorComponent.

> Wrong TermVector docfreq calculation with enabled ExactStatsCache
> -----------------------------------------------------------------
>
>                 Key: SOLR-8893
>                 URL: https://issues.apache.org/jira/browse/SOLR-8893
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.5
>            Reporter: Andreas Daffner
>
> Hi,
> we are currently facing the issue that some calculated values of the TV component are obviously wrong with enabled
> ExactStatsCache. --> shard-wide TV docfreq calculation
> This problem is subsequent to
> SOLR-8459 NPE using TermVectorComponent in combinition with ExactStatsCache
> Maybe the problem is very trivial and we configured something wrong ...
> So lets go deeper into that problem:
> 1) The problem in summary:
> ==================
> We are requesting with enabled "tv.df", "tv.tf" and "tv.tf_idf" --> 
> {code}
> tv.df=true&tv.tf_idf=true&tv.tf=true
> {code}
> additionally for debugging purposes we are requesting by calling 
> {code}
> termfreq("site_term_maincontent","abakus"),docfreq("site_maincontent_term_wdf","abakus"),ttf("site_maincontent_term_wdf","abakus")
> {code}
> Our findings are:
> - the tv.tf as well as the termfreq seems to be correct
> - the tv.df as well as the docfreq is obviously wrong
> - the tv.tf_idf as well as ttf is wrong as well, I guess as subsequent fault of the tv.df (docfeq)
> 2) What we have:
> ===========
> schema.xml:
> {code}
> ...
>         <field name="site_maincontent_term_wdf" type="text_token_wdf" indexed="true" stored="true" termVectors="true"
>                termPositions="true" termOffsets="true"/>
> ...
>         <fieldType name="text_token_wdf" class="solr.TextField" positionIncrementGap="100">
>             <analyzer>
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>             </analyzer>
>         </fieldType>
> ...
> {code}
> solrconfig.xml:
> {code}
> ...
>     <statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
> ...
>     <searchComponent name="tvComponent" class="org.apache.solr.handler.component.TermVectorComponent"/>
>     <requestHandler name="/tvrh" class="org.apache.solr.handler.component.SearchHandler">
>         <lst name="defaults">
>             <bool name="tv">true</bool>
>         </lst>
>         <arr name="last-components">
>             <str>tvComponent</str>
>         </arr>
>     </requestHandler>
> ...
> {code}
> You can find out any details here:
> http://149.202.5.192:8820/solr/#/SingleDomainSite_34_shard1_replica1
> 3) Examples
> ========
> If you are calling this link you can see that there are 6 existent documents containing the word "abakus" in the field "site_maincontent_term_wdf" ...
> http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?q=site_maincontent_term_wdf%3Aabakus+AND+site_headercode%3A200&shards.qt=%2Ftvrh&tv.fl=site_maincontent_term_wdf&tv.df=true&tv.tf_idf=true&tv.tf=true&fl=site_url_id,site_url,termfreq%28%22site_term_maincontent%22,%22abakus%22%29,docfreq%28%22site_maincontent_term_wdf%22,%22abakus%22%29,ttf%28%22site_maincontent_term_wdf%22,%22abakus%22%29
> But if you are looking into the field "docfreq" in the output documents, it is incorrect and always different (sould be always the same ...).
> "docfreq(field,term) returns the number of documents that contain the term in the field. This is a constant (the same value for all documents in the index)."
> Here is a link with enabled shards.info:
> http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?&wt=xml&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=10&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&shards.qt=/tvrh&shards.info=true
> Here is a link with enabled debug:
> http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?omitHeader=true&shards.qt=%2Ftvrh&wt=xml&json.nl=flat&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=1000&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&debugQuery=true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org