You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Yago Riveiro <ya...@gmail.com> on 2015/12/22 00:56:35 UTC

Json facet api method stream

Hi,

The json facet API method "stream" uses the docvalues internally for do the
aggregation on the fly?

I wan't to know if using this method justifies have the docvalues configured
in schema.



-----
Best regards
--
View this message in context: http://lucene.472066.n3.nabble.com/Json-facet-api-method-stream-tp4246520.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Json facet api method stream

Posted by Yago Riveiro <ya...@gmail.com>.
Here a live example




[yago@dev-1 ~]$ time curl -g "http://dev-1:8983/solr/collection-perf/query?rows=0&q=date:[20150101%20TO%2020150115]&json.facet={label:{type:terms,field:url_encoded,limit:-1,sort:{index:asc},facet:{user:'hll(user_id)'}}}" > dump




  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

100 90.7M    0 90.7M    0     0  1039k      0 --:--:--  0:01:29 --:--:-- 21.2M




real	1m29.387s

user	0m0.065s

sys	0m0.338s




[yago@dev-1 ~]$ time curl -g "http://dev-1/solr/collection-perf/query?rows=0&q=date:[20150101%20TO%2020150115]&json.facet={label:{type:terms,field:url_encoded,limit:-1,sort:{index:asc},method:stream,facet:{user:'hll(user_id)'}}}" > dump-stream




  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

100 90.7M    0 90.7M    0     0  9276k      0 --:--:--  0:00:10 --:--:-- 22.6M




real	0m10.026s

user	0m0.038s

sys	0m0.245s





[yago@dev-1 ~]$ diff dump dump-stream

[yago@dev-1 ~]$




—/Yago Riveiro

On Tue, Dec 22, 2015 at 3:57 PM, Yago Riveiro <ya...@gmail.com>
wrote:

> The collection is a 12 shards distributed to 12 physical nodes (24G heap each, 32G RAM) (no replication). all cache are disable in solrconfig.xml, The rate of indexing is about 2000 docs/s, this transform cache useless 
> At the time of the perf test the amount of docs were 34M (now is 54 but the set will grow to 600 millions more or less) with 7M (and growing) unique keys. I’m indexing docs with an url and an user_id.
> {
> name: “url_encoded",
> type: "string",
> docValues: true,
> indexed: true,
> stored: true
> },
> {
> name: “user_id",
> type: "tlong",
> docValues: true,
> multiValued: false,
> indexed: true,
> stored: true
> },
> The query is simple, aggregate by url with a subfacet to each url to calculate the estimate unique users
> I’m using Solr 5.3.1.
> - Normal query (I guess uses under the hood the DVs): json.facet={url:{type:terms,field:url,limit:-1,sort:{index:asc},facet:{users:’hll(user_id)'}}}
> - Streaming query:  json.facet={url:{type:terms,field:url,limit:-1,sort:{index:asc},facet:{users:’hll(user_id)’}, method:stream}}
> This is a perf test to see if sorl has the capacity to aggregate the 600M url with the unique users and the average response time (minutes is acceptable, but less as possible is desirable)
> —/Yago Riveiro
> On Tue, Dec 22, 2015 at 3:27 PM, Yonik Seeley <ys...@gmail.com> wrote:
>> On Tue, Dec 22, 2015 at 6:06 AM, Yago Riveiro <ya...@gmail.com> wrote:
>>> I’m surprised with the difference of speed between DV and stream, the same query (aggregate 7M unique keys) with stream method takes 21s and with DV is about 3 minutes ...
>> Wow - is this a "real" DV field, or one that was built on-demand in
>> the FieldCache?  Were those times for the first request, or subsequent
>> requests?
>> What are the characteristics of that field... i.e. how many unique
>> values in the shard (local index being queried) and how many typical
>> values per field?
>> And how many docs total on the shard?
>> -Yonik

Re: Json facet api method stream

Posted by Yago Riveiro <ya...@gmail.com>.
The collection is a 12 shards distributed to 12 physical nodes (24G heap each, 32G RAM) (no replication). all cache are disable in solrconfig.xml, The rate of indexing is about 2000 docs/s, this transform cache useless 




At the time of the perf test the amount of docs were 34M (now is 54 but the set will grow to 600 millions more or less) with 7M (and growing) unique keys. I’m indexing docs with an url and an user_id.





{
name: “url_encoded",



type: "string",



docValues: true,



indexed: true,



stored: true



},






{
name: “user_id",



type: "tlong",



docValues: true,



multiValued: false,



indexed: true,



stored: true



},





The query is simple, aggregate by url with a subfacet to each url to calculate the estimate unique users




I’m using Solr 5.3.1.




- Normal query (I guess uses under the hood the DVs): json.facet={url:{type:terms,field:url,limit:-1,sort:{index:asc},facet:{users:’hll(user_id)'}}}

- Streaming query:  json.facet={url:{type:terms,field:url,limit:-1,sort:{index:asc},facet:{users:’hll(user_id)’}, method:stream}}




This is a perf test to see if sorl has the capacity to aggregate the 600M url with the unique users and the average response time (minutes is acceptable, but less as possible is desirable)


—/Yago Riveiro

On Tue, Dec 22, 2015 at 3:27 PM, Yonik Seeley <ys...@gmail.com> wrote:

> On Tue, Dec 22, 2015 at 6:06 AM, Yago Riveiro <ya...@gmail.com> wrote:
>> I’m surprised with the difference of speed between DV and stream, the same query (aggregate 7M unique keys) with stream method takes 21s and with DV is about 3 minutes ...
> Wow - is this a "real" DV field, or one that was built on-demand in
> the FieldCache?  Were those times for the first request, or subsequent
> requests?
> What are the characteristics of that field... i.e. how many unique
> values in the shard (local index being queried) and how many typical
> values per field?
> And how many docs total on the shard?
> -Yonik

Re: Json facet api method stream

Posted by Yonik Seeley <ys...@gmail.com>.
On Tue, Dec 22, 2015 at 6:06 AM, Yago Riveiro <ya...@gmail.com> wrote:
> I’m surprised with the difference of speed between DV and stream, the same query (aggregate 7M unique keys) with stream method takes 21s and with DV is about 3 minutes ...

Wow - is this a "real" DV field, or one that was built on-demand in
the FieldCache?  Were those times for the first request, or subsequent
requests?
What are the characteristics of that field... i.e. how many unique
values in the shard (local index being queried) and how many typical
values per field?
And how many docs total on the shard?

-Yonik

Re: Json facet api method stream

Posted by Yago Riveiro <ya...@gmail.com>.
Ok,




I’m surprised with the difference of speed between DV and stream, the same query (aggregate 7M unique keys) with stream method takes 21s and with DV is about 3 minutes ... 


—/Yago Riveiro

On Tue, Dec 22, 2015 at 1:46 AM, Yonik Seeley <ys...@gmail.com> wrote:

> On Mon, Dec 21, 2015 at 6:56 PM, Yago Riveiro <ya...@gmail.com> wrote:
>> The json facet API method "stream" uses the docvalues internally for do the
>> aggregation on the fly?
>>
>> I wan't to know if using this method justifies have the docvalues configured
>> in schema.
> It won't use docValues for the actual field being faceted on (because
> streaming in term order means that it's most efficient to use the term
> index and not docValues to find all of the docs that match a given
> term).
> It will use docValues for sub-facets/stats.
> -Yonik

Re: Json facet api method stream

Posted by Yonik Seeley <ys...@gmail.com>.
On Mon, Dec 21, 2015 at 6:56 PM, Yago Riveiro <ya...@gmail.com> wrote:
> The json facet API method "stream" uses the docvalues internally for do the
> aggregation on the fly?
>
> I wan't to know if using this method justifies have the docvalues configured
> in schema.

It won't use docValues for the actual field being faceted on (because
streaming in term order means that it's most efficient to use the term
index and not docValues to find all of the docs that match a given
term).

It will use docValues for sub-facets/stats.

-Yonik