You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Whelan, Andy" <aw...@srcinc.com> on 2016/07/07 18:28:20 UTC
Facet in SOLR Cloud vs Core
Hello,
I have am somewhat of a novice when it comes to using SOLR in a distributed SolrCloud environment. My team and I are doing development work with a SOLR core. We will shortly be transitioning over to a SolrCloud environment.
My question specifically has to do with Facets in a SOLR cloud/collection (distributed environment). The core I am working with has a field "dataSourceName" defined as following in its schema.xml file.
<field name="dataSourceName" type="string" indexed="true" stored="true" required="true"/>
I am using the following facet query which works fine in more Core based index
http://localhost:8983/solr/gamra/select?q=*:*&rows=0&facet=true&facet.field=dataSourceName
It returns counts for each distinct dataSourceName as follows (which is the desired behavior).
<lst name="facet_fields">
<lst name="dataSourceName">
<int name="DATA_SOURCE1">169</int>
<int name=" DATA_SOURCE2">121</int>
<int name=" DATA_SOURCE3">68</int>
</lst>
</lst>
I am wondering if this should work fine in the SOLR Cloud as well? Will this method give me accurate counts out of the box in a SOLR Cloud configuration?
Thanks
-Andrew
PS: The reason I ask is because I know there is some estimating performed in certain cases for the Facet "unique" function (as is outlined here: http://yonik.com/solr-count-distinct/ ). So I guess I am wondering why folks wouldn't just do what I have done vs going throught the trouble of using the unique(dataSourceName) function?
Re: Facet in SOLR Cloud vs Core
Posted by Chris Hostetter <ho...@fucit.org>.
: My question specifically has to do with Facets in a SOLR
: cloud/collection (distributed environment). The core I am working with
...
: I am using the following facet query which works fine in more Core based index
:
: http://localhost:8983/solr/gamra/select?q=*:*&rows=0&facet=true&facet.field=dataSourceName
:
: It returns counts for each distinct dataSourceName as follows (which is the desired behavior).
...
: I am wondering if this should work fine in the SOLR Cloud as well?
: Will this method give me accurate counts out of the box in a SOLR Cloud
: configuration?
Yes it will.
solr uses a two pass aproach for faceting -- in pass #1 the "top"
constraints are determined from each shard (overrequesting based on your
original facet.limit), and then aggregated together. pass #2 is a
"refinement" step: any terms from the agregated "top" constraints are
checked to see shich shards (if any) did not include them in the per-shard
"top" constraints, and those shards are asked to compute a constraint
count for terms as needed -- these are then added into the aggregated
counts for each term, and the terms are resorted.
This means that in some pathelogical term distributions, a term may be
excluded from the list of "top" terms if it isn't returned by *any* shard
in pass #1, but for any term that is returned to the end client, the count
is 100% accurate.
(NOTE: this info applies to the default solr faceting, and solr's pivot
faceting -- but the relatively new "json faceting" does not support these
multi-pass refinement of the facet counts.
: PS: The reason I ask is because I know there is some estimating
: performed in certain cases for the Facet "unique" function (as is
: outlined here: http://yonik.com/solr-count-distinct/ ). So I guess I am
: wondering why folks wouldn't just do what I have done vs going throught
: the trouble of using the unique(dataSourceName) function?
what you linked to is addressing a diff problem then simple facet
counts. in your case you are getting the "top" terms with their
document counts, but what that blog post is refering to is counting the
total number of unique *terms* (ie: in your data set: what is the total
number of all unique values in the "dataSourceName" field?
distributed counting of unique values in a high cardinality sets is a
"hard" problem, as the only way to be 100% accurate is to aggregate all
terms from all shards into a single node to be hashed (or sorted) ... for
"batch" style analytics this is a trivial map-reduce style job that can
offload to disk, but in "real time" situations, statistical sampling
approaches like HyperLogLog (used in solr) make more sense to get
aproximations w/o exploding ram usage.
-Hoss
http://www.lucidworks.com/
Re: Facet in SOLR Cloud vs Core
Posted by Pablo Anzorena <an...@gmail.com>.
Sorry for introducing bad information.
Because it happens in the json facet api, I thought it would also happen in
the facet. Soyrry again for the misunderstood.
2016-07-07 16:08 GMT-03:00 Chris Hostetter <ho...@fucit.org>:
>
> : The problem with the shards appears in the following scenario (note that
> : the problem below also applies in a solr standalone enviroment with
> : distributed search):
> :
> : Shard1: DATA_SOURCE1 (3 docs), DATA_SOURCE2 (2 docs), DATA_SOURCE3 (2
> docs).
> : Shard2: DATA_SOURCE3 (2 docs), DATA_SOURCE2 (1 docs).
> :
> : If you make a distributed search across these two shards, faceting
> : dataSourceName with a limit of 1, it will ask for the top 1 in the first
> : shard (DATA_SOURCE1 (3 docs)) and for the top 1 in the second shard
> : (DATA_SOURCE3
> : (2 docs)). After that it will merge the results and return DATA_SOURCE1
> (3
> : docs), when it should have return DATA_SOURCE3 (4 docs).
>
> That's completley false.
>
> a) in the first pass, even if you ask for "top 1" (ie: facet.limit=1) solr
> will overrequest when comunicating with each shard (the amount of
> overrequest is a function of your facet.limit, so as facet.limit increases
> so does the overrequest amount)
>
> b) if *any* (but not *all*) shards returns DATA_SOURCE3 from the
> initial shard request, a second "refinement" step will request the count
> for DATA_SOURCE3 from all of the other shards to get an accurate count,
> and to accurately sort DATA_SOURCE3 to the top of the facet constraint
> list.
>
>
> -Hoss
> http://www.lucidworks.com/
>
Re: Facet in SOLR Cloud vs Core
Posted by Chris Hostetter <ho...@fucit.org>.
: The problem with the shards appears in the following scenario (note that
: the problem below also applies in a solr standalone enviroment with
: distributed search):
:
: Shard1: DATA_SOURCE1 (3 docs), DATA_SOURCE2 (2 docs), DATA_SOURCE3 (2 docs).
: Shard2: DATA_SOURCE3 (2 docs), DATA_SOURCE2 (1 docs).
:
: If you make a distributed search across these two shards, faceting
: dataSourceName with a limit of 1, it will ask for the top 1 in the first
: shard (DATA_SOURCE1 (3 docs)) and for the top 1 in the second shard
: (DATA_SOURCE3
: (2 docs)). After that it will merge the results and return DATA_SOURCE1 (3
: docs), when it should have return DATA_SOURCE3 (4 docs).
That's completley false.
a) in the first pass, even if you ask for "top 1" (ie: facet.limit=1) solr
will overrequest when comunicating with each shard (the amount of
overrequest is a function of your facet.limit, so as facet.limit increases
so does the overrequest amount)
b) if *any* (but not *all*) shards returns DATA_SOURCE3 from the
initial shard request, a second "refinement" step will request the count
for DATA_SOURCE3 from all of the other shards to get an accurate count,
and to accurately sort DATA_SOURCE3 to the top of the facet constraint
list.
-Hoss
http://www.lucidworks.com/
Re: Facet in SOLR Cloud vs Core
Posted by Pablo Anzorena <an...@gmail.com>.
As long as you don't shard your index, you will have no problem migrating
to solrcloud.
The problem with the shards appears in the following scenario (note that
the problem below also applies in a solr standalone enviroment with
distributed search):
Shard1: DATA_SOURCE1 (3 docs), DATA_SOURCE2 (2 docs), DATA_SOURCE3 (2 docs).
Shard2: DATA_SOURCE3 (2 docs), DATA_SOURCE2 (1 docs).
If you make a distributed search across these two shards, faceting
dataSourceName with a limit of 1, it will ask for the top 1 in the first
shard (DATA_SOURCE1 (3 docs)) and for the top 1 in the second shard
(DATA_SOURCE3
(2 docs)). After that it will merge the results and return DATA_SOURCE1 (3
docs), when it should have return DATA_SOURCE3 (4 docs).
Summarizing: if you make a distributed search with a facet.limit, there is
a chance that the count is not correct (it also applies to stats).
2016-07-07 15:28 GMT-03:00 Whelan, Andy <aw...@srcinc.com>:
> Hello,
>
> I have am somewhat of a novice when it comes to using SOLR in a
> distributed SolrCloud environment. My team and I are doing development work
> with a SOLR core. We will shortly be transitioning over to a SolrCloud
> environment.
>
> My question specifically has to do with Facets in a SOLR cloud/collection
> (distributed environment). The core I am working with has a field
> "dataSourceName" defined as following in its schema.xml file.
>
> <field name="dataSourceName" type="string" indexed="true" stored="true"
> required="true"/>
>
> I am using the following facet query which works fine in more Core based
> index
>
>
> http://localhost:8983/solr/gamra/select?q=*:*&rows=0&facet=true&facet.field=dataSourceName
>
> It returns counts for each distinct dataSourceName as follows (which is
> the desired behavior).
>
> <lst name="facet_fields">
> <lst name="dataSourceName">
> <int name="DATA_SOURCE1">169</int>
> <int name=" DATA_SOURCE2">121</int>
> <int name=" DATA_SOURCE3">68</int>
> </lst>
> </lst>
>
> I am wondering if this should work fine in the SOLR Cloud as well? Will
> this method give me accurate counts out of the box in a SOLR Cloud
> configuration?
>
> Thanks
> -Andrew
>
> PS: The reason I ask is because I know there is some estimating performed
> in certain cases for the Facet "unique" function (as is outlined here:
> http://yonik.com/solr-count-distinct/ ). So I guess I am wondering why
> folks wouldn't just do what I have done vs going throught the trouble of
> using the unique(dataSourceName) function?
>
>
>