You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alessandro Benedetti <ab...@apache.org> on 2016/07/04 13:20:43 UTC
Re: How to speed up field collapsing on large number of groups
Have you tried with docValues for the fields involved in the collapse group
head selection ?
With a group head selection of "min" "max"and "sort" should work quite well.
Of course it depends of your formula.
Does your index change often ?
If the warming time is not a problem you could try with :
hint
Currently there is only one hint available "top_fc", which stands for top
level FieldCache. The top_fc hint is only available when collapsing on
String fields. top_fc provides the best query time speed but takes the
longest to warm on startup or following a commit. top_fc also will result
in having the collapsed field cached in memory twice if the it's used for
faceting or sorting.
Cheers
On Wed, Jun 29, 2016 at 1:59 AM, Jichi Guo <ji...@gmail.com> wrote:
> Thanks for the quick response, Joel!
>
> I am hoping to delay sharding if possible, which might involve more things
> to
> consider :)
>
>
>
> 1) What is the size of the result set before the collapse?
>
>
>
> When search with q=*:* for example, before collapse numFound is around 5
> million, and that after collapse is 2 million.
>
> I only return about the top 30 documents in the result.
>
>
>
> 2) Have you tested without the long formula, just using a field for the
> min/max. It would be good to understand the impact of the formula on
> performance.
>
>
>
> The performance seems to be affected by the number of fields appearing in
> the
> max formula.
>
>
>
> For example, that 5 million expensive query would take 4.4 sec.
>
> For both {!collapse field=productGroupId} and {!collapse
> field=productGroupId
> max=only_one_field}, the query time would reduce to around 2.4 sec.
>
> If I remove the entire collapse fq, then the query only took 1.3 sec.
>
>
>
> 3) How much memory do you have on the server and for the heap. Memory use
> rises with the cardinality of the collapse field. So you'll want to be sure
> there is enough memory to comfortably perform the collapse.
>
>
>
> I am setting Xmx to 24G. The total index size on disk is 50G.
>
> In solrconfig.xml, I use solr.FastLRUCache for filterCache with cache size
> 2048, solr.LRUCache for documentCache with cache size 32768, and
> solr.LRUCache
> for queryResultCache with cache size 4096. I am using default
> fieldValueCache.
>
>
>
> I found Collapsing QParser plugin explicitly uses lucene's field cache.
>
> Maybe, increasing fieldCache would help? But I am not sure how to
> increase it
> in Solr.
>
>
> Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
> /local-
>
> 481233c4-d727/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn),
> the extensible, open source mail client.
>
> ![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
> 481233c4-d727?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>
> On Jun 28 2016, at 4:48 pm, Joel Bernstein <joelsolr@gmail.com>
> wrote:
>
> > Sharding will help, but you'll need to co-locate documents by group ID. A
> few questions / suggestions:
>
> >
>
> >
> >
>
> >
>
> > 1) What is the size of the result set before the collapse?
>
> >
>
> > 2) Have you tested without the long formula, just using a field for the
> min/max. It would be good to understand the impact of the formula on
> performance.
>
> >
>
> > 3) How much memory do you have on the server and for the heap. Memory use
> rises with the cardinality of the collapse field. So you'll want to be sure
> there is enough memory to comfortably perform the collapse.
>
> >
>
> >
> >
>
> >
>
> >
> >
>
> >
>
> >
> >
>
> >
>
> > Joel Bernstein
>
> >
>
> >
> [
> http://joelsolr.blogspot.com/](http://joelsolr.blogspot.com/&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
> >
>
> >
>
> >
> >
>
> >
>
> > On Tue, Jun 28, 2016 at 4:08 PM, jichi
> <[jichifly@gmail.com](mailto:jichifly@gmail.com)> wrote:
> >
>
> >
>
> >> Hi everyone,
> >
> > I am using Solr 4.10 to index 20 million documents without sharding.
> > Each document has a groupId field, and there are about 2 million groups.
> > I found the search with collapsing on groupId significantly slower
> > comparing to without collapsing, especially when combined with facet
> > queries.
> >
> > I am wondering what would be the general approach to speedup field
> > collapsing by 2~4 times?
> > Would sharding the index help?
> > Is it possible to optimize collapsing without sharding?
> >
> > The filter parameter for collapsing is like this:
> >
> > q=*:*&fq={!collapse field=groupId max=sum(...a long formula...)}
> >
> > I also put this fq into warmup queries xml to warmup caches. But still,
> > when q changes and more fq are added, the collapsing search would take
> > about 3~5 seconds. Without collapsing, the search can finish within 2
> > seconds.
> >
> > I am thinking to manually optimize CollapsingQParserPlugin through
> > parallelization or extra caching.
> > For example, is it possible to parallelize collapsing collector by
> > different lucene index segments?
> >
> > Thanks!
> >
> > \--
> > jichi
> >
>
> >
>
> >
> >
>
>
--
--------------------------
Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti
"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"
William Blake - Songs of Experience -1794 England
Re: How to speed up field collapsing on large number of groups
Posted by Joel Bernstein <jo...@gmail.com>.
The top_fc hint doesn't come into play until Solr 5. With Solr 4x the
CollapsingQParserPlugin always uses a top level field cache.
Joel Bernstein
http://joelsolr.blogspot.com/
On Mon, Jul 4, 2016 at 9:20 AM, Alessandro Benedetti <ab...@apache.org>
wrote:
> Have you tried with docValues for the fields involved in the collapse
> group head selection ?
>
> With a group head selection of "min" "max"and "sort" should work quite
> well.
> Of course it depends of your formula.
>
> Does your index change often ?
> If the warming time is not a problem you could try with :
>
> hint
>
> Currently there is only one hint available "top_fc", which stands for top
> level FieldCache. The top_fc hint is only available when collapsing on
> String fields. top_fc provides the best query time speed but takes the
> longest to warm on startup or following a commit. top_fc also will result
> in having the collapsed field cached in memory twice if the it's used for
> faceting or sorting.
>
> Cheers
>
> On Wed, Jun 29, 2016 at 1:59 AM, Jichi Guo <ji...@gmail.com> wrote:
>
>> Thanks for the quick response, Joel!
>>
>> I am hoping to delay sharding if possible, which might involve more
>> things to
>> consider :)
>>
>>
>>
>> 1) What is the size of the result set before the collapse?
>>
>>
>>
>> When search with q=*:* for example, before collapse numFound is around 5
>> million, and that after collapse is 2 million.
>>
>> I only return about the top 30 documents in the result.
>>
>>
>>
>> 2) Have you tested without the long formula, just using a field for the
>> min/max. It would be good to understand the impact of the formula on
>> performance.
>>
>>
>>
>> The performance seems to be affected by the number of fields appearing in
>> the
>> max formula.
>>
>>
>>
>> For example, that 5 million expensive query would take 4.4 sec.
>>
>> For both {!collapse field=productGroupId} and {!collapse
>> field=productGroupId
>> max=only_one_field}, the query time would reduce to around 2.4 sec.
>>
>> If I remove the entire collapse fq, then the query only took 1.3 sec.
>>
>>
>>
>> 3) How much memory do you have on the server and for the heap. Memory use
>> rises with the cardinality of the collapse field. So you'll want to be
>> sure
>> there is enough memory to comfortably perform the collapse.
>>
>>
>>
>> I am setting Xmx to 24G. The total index size on disk is 50G.
>>
>> In solrconfig.xml, I use solr.FastLRUCache for filterCache with cache size
>> 2048, solr.LRUCache for documentCache with cache size 32768, and
>> solr.LRUCache
>> for queryResultCache with cache size 4096. I am using default
>> fieldValueCache.
>>
>>
>>
>> I found Collapsing QParser plugin explicitly uses lucene's field cache.
>>
>> Maybe, increasing fieldCache would help? But I am not sure how to
>> increase it
>> in Solr.
>>
>>
>> Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
>> /local-
>>
>> 481233c4-d727/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn),
>> the extensible, open source mail client.
>>
>> ![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
>> 481233c4-d727?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>>
>> On Jun 28 2016, at 4:48 pm, Joel Bernstein <joelsolr@gmail.com>
>> wrote:
>>
>> > Sharding will help, but you'll need to co-locate documents by group ID.
>> A
>> few questions / suggestions:
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> > 1) What is the size of the result set before the collapse?
>>
>> >
>>
>> > 2) Have you tested without the long formula, just using a field for the
>> min/max. It would be good to understand the impact of the formula on
>> performance.
>>
>> >
>>
>> > 3) How much memory do you have on the server and for the heap. Memory
>> use
>> rises with the cardinality of the collapse field. So you'll want to be
>> sure
>> there is enough memory to comfortably perform the collapse.
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> > Joel Bernstein
>>
>> >
>>
>> >
>> [
>> http://joelsolr.blogspot.com/](http://joelsolr.blogspot.com/&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>> >
>>
>> >
>>
>> >
>> >
>>
>> >
>>
>> > On Tue, Jun 28, 2016 at 4:08 PM, jichi
>> <[jichifly@gmail.com](mailto:jichifly@gmail.com)> wrote:
>> >
>>
>> >
>>
>> >> Hi everyone,
>> >
>> > I am using Solr 4.10 to index 20 million documents without sharding.
>> > Each document has a groupId field, and there are about 2 million
>> groups.
>> > I found the search with collapsing on groupId significantly slower
>> > comparing to without collapsing, especially when combined with facet
>> > queries.
>> >
>> > I am wondering what would be the general approach to speedup field
>> > collapsing by 2~4 times?
>> > Would sharding the index help?
>> > Is it possible to optimize collapsing without sharding?
>> >
>> > The filter parameter for collapsing is like this:
>> >
>> > q=*:*&fq={!collapse field=groupId max=sum(...a long
>> formula...)}
>> >
>> > I also put this fq into warmup queries xml to warmup caches. But still,
>> > when q changes and more fq are added, the collapsing search would take
>> > about 3~5 seconds. Without collapsing, the search can finish within 2
>> > seconds.
>> >
>> > I am thinking to manually optimize CollapsingQParserPlugin through
>> > parallelization or extra caching.
>> > For example, is it possible to parallelize collapsing collector by
>> > different lucene index segments?
>> >
>> > Thanks!
>> >
>> > \--
>> > jichi
>> >
>>
>> >
>>
>> >
>> >
>>
>>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>