You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alessandro Benedetti <ab...@apache.org> on 2016/07/04 13:20:43 UTC

Re: How to speed up field collapsing on large number of groups

Have you tried with docValues for the fields involved in the collapse group
head selection ?

With a group head selection of "min" "max"and "sort" should work quite well.
Of course it depends of your formula.

Does your index change often ?
If the warming time is not a problem you could try with :

hint

Currently there is only one hint available "top_fc", which stands for top
level FieldCache. The top_fc hint is only available when collapsing on
String fields. top_fc provides the best query time speed but takes the
longest to warm on startup or following a commit. top_fc also will result
in having the collapsed field cached in memory twice if the it's used for
faceting or sorting.

Cheers

On Wed, Jun 29, 2016 at 1:59 AM, Jichi Guo <ji...@gmail.com> wrote:

> Thanks for the quick response, Joel!
>
> I am hoping to delay sharding if possible, which might involve more things
> to
> consider :)
>
>
>
> 1) What is the size of the result set before the collapse?
>
>
>
> When search with q=*:* for example, before collapse numFound is around 5
> million, and that after collapse is 2 million.
>
> I only return about the top 30 documents in the result.
>
>
>
> 2) Have you tested without the long formula, just using a field for the
> min/max. It would be good to understand the impact of the formula on
> performance.
>
>
>
> The performance seems to be affected by the number of fields appearing in
> the
> max formula.
>
>
>
> For example, that 5 million expensive query would take 4.4 sec.
>
> For both {!collapse field=productGroupId} and {!collapse
> field=productGroupId
> max=only_one_field}, the query time would reduce to around 2.4 sec.
>
> If I remove the entire collapse fq, then the query only took 1.3 sec.
>
>
>
> 3) How much memory do you have on the server and for the heap. Memory use
> rises with the cardinality of the collapse field. So you'll want to be sure
> there is enough memory to comfortably perform the collapse.
>
>
>
> I am setting Xmx to 24G. The total index size on disk is 50G.
>
> In solrconfig.xml, I use solr.FastLRUCache for filterCache with cache size
> 2048, solr.LRUCache for documentCache with cache size 32768, and
> solr.LRUCache
> for queryResultCache with cache size 4096. I am using default
> fieldValueCache.
>
>
>
> I found Collapsing QParser plugin explicitly uses lucene's field cache.
>
> Maybe, increasing fieldCache would help?  But I am not sure how to
> increase it
> in Solr.
>
>
> Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
> /local-
>
> 481233c4-d727/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn),
> the extensible, open source mail client.
>
> ![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
> 481233c4-d727?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>
> On Jun 28 2016, at 4:48 pm, Joel Bernstein &lt;joelsolr@gmail.com&gt;
> wrote:
>
> > Sharding will help, but you'll need to co-locate documents by group ID. A
> few questions / suggestions:
>
> >
>
> >
> >
>
> >
>
> > 1) What is the size of the result set before the collapse?
>
> >
>
> > 2) Have you tested without the long formula, just using a field for the
> min/max. It would be good to understand the impact of the formula on
> performance.
>
> >
>
> > 3) How much memory do you have on the server and for the heap. Memory use
> rises with the cardinality of the collapse field. So you'll want to be sure
> there is enough memory to comfortably perform the collapse.
>
> >
>
> >
> >
>
> >
>
> >
> >
>
> >
>
> >
> >
>
> >
>
> > Joel Bernstein
>
> >
>
> >
> [
> http://joelsolr.blogspot.com/](http://joelsolr.blogspot.com/&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
> >
>
> >
>
> >
> >
>
> >
>
> > On Tue, Jun 28, 2016 at 4:08 PM, jichi
> &lt;[jichifly@gmail.com](mailto:jichifly@gmail.com)&gt; wrote:
> >
>
> >
>
> >> Hi everyone,
> >
> >  I am using Solr 4.10 to index 20 million documents without sharding.
> >  Each document has a groupId field, and there are about 2 million groups.
> >  I found the search with collapsing on groupId significantly slower
> >  comparing to without collapsing, especially when combined with facet
> >  queries.
> >
> >  I am wondering what would be the general approach to speedup field
> >  collapsing by 2~4 times?
> >  Would sharding the index help?
> >  Is it possible to optimize collapsing without sharding?
> >
> >  The filter parameter for collapsing is like this:
> >
> >      q=*:*&amp;fq={!collapse field=groupId max=sum(...a long formula...)}
> >
> >  I also put this fq into warmup queries xml to warmup caches. But still,
> >  when q changes and more fq are added, the collapsing search would take
> >  about 3~5 seconds. Without collapsing, the search can finish within 2
> >  seconds.
> >
> >  I am thinking to manually optimize CollapsingQParserPlugin through
> >  parallelization or extra caching.
> >  For example, is it possible to parallelize collapsing collector by
> >  different lucene index segments?
> >
> >  Thanks!
> >
> >  \--
> >  jichi
> >
>
> >
>
> >
> >
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: How to speed up field collapsing on large number of groups

Posted by Joel Bernstein <jo...@gmail.com>.
The top_fc hint doesn't come into play until Solr 5.  With Solr 4x the
CollapsingQParserPlugin always uses a top level field cache.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Jul 4, 2016 at 9:20 AM, Alessandro Benedetti <ab...@apache.org>
wrote:

> Have you tried with docValues for the fields involved in the collapse
> group head selection ?
>
> With a group head selection of "min" "max"and "sort" should work quite
> well.
> Of course it depends of your formula.
>
> Does your index change often ?
> If the warming time is not a problem you could try with :
>
> hint
>
> Currently there is only one hint available "top_fc", which stands for top
> level FieldCache. The top_fc hint is only available when collapsing on
> String fields. top_fc provides the best query time speed but takes the
> longest to warm on startup or following a commit. top_fc also will result
> in having the collapsed field cached in memory twice if the it's used for
> faceting or sorting.
>
> Cheers
>
> On Wed, Jun 29, 2016 at 1:59 AM, Jichi Guo <ji...@gmail.com> wrote:
>
>> Thanks for the quick response, Joel!
>>
>> I am hoping to delay sharding if possible, which might involve more
>> things to
>> consider :)
>>
>>
>>
>> 1) What is the size of the result set before the collapse?
>>
>>
>>
>> When search with q=*:* for example, before collapse numFound is around 5
>> million, and that after collapse is 2 million.
>>
>> I only return about the top 30 documents in the result.
>>
>>
>>
>> 2) Have you tested without the long formula, just using a field for the
>> min/max. It would be good to understand the impact of the formula on
>> performance.
>>
>>
>>
>> The performance seems to be affected by the number of fields appearing in
>> the
>> max formula.
>>
>>
>>
>> For example, that 5 million expensive query would take 4.4 sec.
>>
>> For both {!collapse field=productGroupId} and {!collapse
>> field=productGroupId
>> max=only_one_field}, the query time would reduce to around 2.4 sec.
>>
>> If I remove the entire collapse fq, then the query only took 1.3 sec.
>>
>>
>>
>> 3) How much memory do you have on the server and for the heap. Memory use
>> rises with the cardinality of the collapse field. So you'll want to be
>> sure
>> there is enough memory to comfortably perform the collapse.
>>
>>
>>
>> I am setting Xmx to 24G. The total index size on disk is 50G.
>>
>> In solrconfig.xml, I use solr.FastLRUCache for filterCache with cache size
>> 2048, solr.LRUCache for documentCache with cache size 32768, and
>> solr.LRUCache
>> for queryResultCache with cache size 4096. I am using default
>> fieldValueCache.
>>
>>
>>
>> I found Collapsing QParser plugin explicitly uses lucene's field cache.
>>
>> Maybe, increasing fieldCache would help?  But I am not sure how to
>> increase it
>> in Solr.
>>
>>
>> Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
>> /local-
>>
>> 481233c4-d727/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn),
>> the extensible, open source mail client.
>>
>> ![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
>> 481233c4-d727?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>>
>> On Jun 28 2016, at 4:48 pm, Joel Bernstein &lt;joelsolr@gmail.com&gt;
>> wrote:
>>
>> > Sharding will help, but you'll need to co-locate documents by group ID.
>> A
>> few questions / suggestions:
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> > 1) What is the size of the result set before the collapse?
>>
>> >
>>
>> > 2) Have you tested without the long formula, just using a field for the
>> min/max. It would be good to understand the impact of the formula on
>> performance.
>>
>> >
>>
>> > 3) How much memory do you have on the server and for the heap. Memory
>> use
>> rises with the cardinality of the collapse field. So you'll want to be
>> sure
>> there is enough memory to comfortably perform the collapse.
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> > Joel Bernstein
>>
>> >
>>
>> >
>> [
>> http://joelsolr.blogspot.com/](http://joelsolr.blogspot.com/&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>> >
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> > On Tue, Jun 28, 2016 at 4:08 PM, jichi
>> &lt;[jichifly@gmail.com](mailto:jichifly@gmail.com)&gt; wrote:
>> >
>>
>> >
>>
>> >> Hi everyone,
>> >
>> >  I am using Solr 4.10 to index 20 million documents without sharding.
>> >  Each document has a groupId field, and there are about 2 million
>> groups.
>> >  I found the search with collapsing on groupId significantly slower
>> >  comparing to without collapsing, especially when combined with facet
>> >  queries.
>> >
>> >  I am wondering what would be the general approach to speedup field
>> >  collapsing by 2~4 times?
>> >  Would sharding the index help?
>> >  Is it possible to optimize collapsing without sharding?
>> >
>> >  The filter parameter for collapsing is like this:
>> >
>> >      q=*:*&amp;fq={!collapse field=groupId max=sum(...a long
>> formula...)}
>> >
>> >  I also put this fq into warmup queries xml to warmup caches. But still,
>> >  when q changes and more fq are added, the collapsing search would take
>> >  about 3~5 seconds. Without collapsing, the search can finish within 2
>> >  seconds.
>> >
>> >  I am thinking to manually optimize CollapsingQParserPlugin through
>> >  parallelization or extra caching.
>> >  For example, is it possible to parallelize collapsing collector by
>> >  different lucene index segments?
>> >
>> >  Thanks!
>> >
>> >  \--
>> >  jichi
>> >
>>
>> >
>>
>> >
>> >
>>
>>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>