You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Toke Eskildsen (JIRA)" <ji...@apache.org> on 2018/12/10 19:54:00 UTC

[jira] [Comment Edited] (SOLR-8362) Add docValues support for TextField

    [ https://issues.apache.org/jira/browse/SOLR-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715478#comment-16715478 ] 

Toke Eskildsen edited comment on SOLR-8362 at 12/10/18 7:53 PM:
----------------------------------------------------------------

[~hossman] using this field type for distributed faceting can lead to wrong results. Maybe this should be noted in the JavaDoc or the Solr documentation?

This can be demonstrated by installing the cloud-version of the {{gettingstarted}} sample with

{{./solr -e cloud}}

using defaults all the way, except for {{shards}} which should be {{3}}. After that a corpus can be indexed with

{{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo "\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo '\{"id":"duplicate_1","facet_t_sort":"a b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/gettingstarted/update?commit=true'}}

This will index 100 documents with a single-valued field {{facet_t_sort:"a b X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}. The call

curl 'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=*:*&rows=0'

should return "a b" as the top facet term with count 2, but returns

{{{}}
{{ "responseHeader":{}}
{{ "zkConnected":true,}}
{{ "status":0,}}
{{ "QTime":13,}}
{{ "params":{}}
{{ "facet.limit":"5",}}
{{ "q":"*:*",}}
{{ "facet.field":"facet_t_sort",}}
{{ "rows":"0",}}
{{ "facet":"on"}},}}
{{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
{{ },}}
{{ "facet_counts":{}}
{{ "facet_queries":{},}}
{{ "facet_fields":{}}
{{ "facet_t_sort":[}}
{{ "a b",36,}}
{{ "a b 0",1,}}
{{ "a b 1",1,}}
{{ "a b 10",1,}}
{{ "a b 11",1]},}}
{{ "facet_ranges":{},}}
{{ "facet_intervals":{},}}
{{ "facet_heatmaps":{}}}}}

The problem is the second phase of simple faceting, where the fine-counting happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It wins the popularity contest as there are 2 "a b"-terms and only 1 of all the other terms. The 1 or 2 shards that did not deliver "a b" in the first phase are then queried for the count for "a b", which happens in the form of a {{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer chain and thus matches _all_ the documents in that shard (approximately 102/3).

An alternative would be to do the fine-counting on the DocValues instead, but that works very poorly with many values, so that seems more like a trap than a solution.


was (Author: toke):
[~hossman] using this field type for distributed faceting can lead to wrong results. Maybe this should be noted in the JavaDoc or the Solr documentation?

This can be demonstrated by installing the cloud-version of the {{gettingstarted}} sample with

{{./solr -e cloud}}

using defaults all the way, except for {{shards}} which should be {{3}}. After that a corpus can be indexed with

{{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo "\\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo '\\{"id":"duplicate_1","facet_t_sort":"a b"},\\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/gettingstarted/update?commit=true'}}

This will index 100 documents with a single-valued field {{facet_t_sort:"a b X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}. The call

curl 'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=*:*&rows=0'

should return "a b" as the top facet term with count 2, but returns

{{{}}
{{ "responseHeader":{}}
{{ "zkConnected":true,}}
{{ "status":0,}}
{{ "QTime":13,}}
{{ "params":{}}
{{ "facet.limit":"5",}}
{{ "q":"*:*",}}
{{ "facet.field":"facet_t_sort",}}
{{ "rows":"0",}}
{{ "facet":"on"}},}}
{{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
{{ },}}
{{ "facet_counts":{}}
{{ "facet_queries":{},}}
{{ "facet_fields":{}}
{{ "facet_t_sort":[}}
{{ "a b",36,}}
{{ "a b 0",1,}}
{{ "a b 1",1,}}
{{ "a b 10",1,}}
{{ "a b 11",1]},}}
{{ "facet_ranges":{},}}
{{ "facet_intervals":{},}}
{{ "facet_heatmaps":{}}}}}

The problem is the second phase of simple faceting, where the fine-counting happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It wins the popularity contest as there are 2 "a b"-terms and only 1 of all the other terms. The 1 or 2 shards that did not deliver "a b" in the first phase are then queried for the count for "a b", which happens in the form of a {{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer chain and thus matches _all_ the documents in that shard (approximately 102/3).

An alternative would be to do the fine-counting on the DocValues instead, but that works very poorly with many values, so that seems more like a trap than a solution.

> Add docValues support for TextField
> -----------------------------------
>
>                 Key: SOLR-8362
>                 URL: https://issues.apache.org/jira/browse/SOLR-8362
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Hoss Man
>            Priority: Major
>
> At the last lucene/solr revolution, Toke asked a question about why TextField doesn't support docValues.  The short answer is because no one ever added it, but the longer answer was because we would have to think through carefully the _intent_ of supporting docValues for  a "tokenized" field like TextField, and how to support various conflicting usecases where they could be handy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org