You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shamik Bandopadhyay <sh...@gmail.com> on 2015/01/16 00:02:05 UTC

Does DocValues improve Grouping performance ?

Hi,

   Does use of DocValues provide any performance improvement for Grouping ?
I' looked into the blog which mentions improving Grouping performance
through DocValues.

https://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/

Right now, Group by queries (which I can't sadly avoid) has become a huge
bottleneck. It has an overhead of 60-70% compared to the same query san
group by. Unfortunately, I'm not able to be CollapsingQParserPlugin as it
doesn't have a support similar to "group.facet" feature.

My understanding on DocValues is that it's intended for faceting and
sorting. Just wondering if anyone have tried DocValues for Grouping and saw
any improvements ?

-Thanks,
Shamik

RE: Does DocValues improve Grouping performance ?

Posted by "Cario, Elaine" <El...@wolterskluwer.com>.
In our case, we have less than 20 distinct groups, and a typical search result will return about 10 of those groups (usually 3 documents per group). We use default sorting by score.  There are 12 million docs spread across 3 shards.  We set group.facet=false.  The wkcluster field is a string field indexed using DocValues. Each document will have a value for the wkcluster field. Sample query:

?q=*%3A*&rows=100&wt=xml&indent=true&group=true&group.field=wkcluster&group.limit=3&hl=false&facet=false&group.facet=false

This query returned 18 groups and took about 1.7 seconds even after executing it a few times.

The main drag we see is that there are 2 internal queries (on each shard) generated when we have group=true. They are essentially the same except for additional group.topgroups params in the 2nd query.  These queries seem to be done serially, so it almost doubles the latency.  I'm not sure if it's something we're doing (or not doing) in the query, or this is just the way it is.

I don't think we can use the aforementioned block-join feature here, as it would be difficult for us to build document blocks based on the group (and there's been requirements to group on different fields).  Unfortunately the grouping feature has been extremely popular in the production applications running on our search platform (we’re migrating from Fast, where grouping performance was quite good).

We do have other performance issues (currently we are investigating an issue with a scale function) - we are hoping we can resolve those to such a point where the double query for grouping isn't so noticable.

-----Original Message-----
From: Joel Bernstein [mailto:joelsolr@gmail.com] 
Sent: Friday, January 30, 2015 6:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Does DocValues improve Grouping performance ?

A few questions so we can better understand the scale of grouping you're trying to accomplish:

How many distinct groups do you typically have in a search result?

How many distinct groups are there in the field you are grouping on?

How many results are you trying to group in a query?

Joel Bernstein
Search Engineer at Heliosearch

On Fri, Jan 30, 2015 at 4:10 PM, Cario, Elaine < Elaine.Cario@wolterskluwer.com> wrote:

> Hi Shamik,
>
> We use DocValues for grouping, and although I have nothing to compare 
> it to (we started with DocValues), we are also seeing similar poor 
> results as
> you: easily 60% overhead compared to non-group queries.  Looking 
> around for some solution, no quick fix is presenting itself unfortunately.
> CollapsingQParserPlugin also is too limited for our needs.
>
> -----Original Message-----
> From: Shamik Bandopadhyay [mailto:shamikb@gmail.com]
> Sent: Thursday, January 15, 2015 6:02 PM
> To: solr-user@lucene.apache.org
> Subject: Does DocValues improve Grouping performance ?
>
> Hi,
>
>    Does use of DocValues provide any performance improvement for Grouping ?
> I' looked into the blog which mentions improving Grouping performance 
> through DocValues.
>
> https://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/
>
> Right now, Group by queries (which I can't sadly avoid) has become a 
> huge bottleneck. It has an overhead of 60-70% compared to the same 
> query san group by. Unfortunately, I'm not able to be 
> CollapsingQParserPlugin as it doesn't have a support similar to "group.facet" feature.
>
> My understanding on DocValues is that it's intended for faceting and 
> sorting. Just wondering if anyone have tried DocValues for Grouping 
> and saw any improvements ?
>
> -Thanks,
> Shamik
>

Re: Does DocValues improve Grouping performance ?

Posted by shamik <sh...@gmail.com>.
Joel,

  To give you some context, we are running queries against 6 million
documents in a Solr cloud environment. The grouping is done to de-duplicate
content based on an unique field. Unfortunately, due to some requirement
constraint, the only way for us to run the de-duplication is during query
time.

The group numbers are pretty high in our case. Average distinct group is
around 1000. The total number of distinct group for the field is around 10k.
Phrase queries are especially worse,averaging a response time of 10-12 secs.
Having said that, CollapsingQParserPlugin makes a huge difference in
performance, only caveat being the lack of support for  "group.facets"
equivalent. I had this discussion earlier with you where you had confirmed
it

http://lucene.472066.n3.nabble.com/RE-SOLR-6143-Bad-facet-counts-from-CollapsingQParserPlugin-td4140455.html#a4146645

Are there any plans to address this ? Not sure if it's a big change at your
end, but if something we can contribute to add it, I'm more than happy to
help. I know there are a bunch of people who are looking forward to this.



--
View this message in context: http://lucene.472066.n3.nabble.com/Does-DocValues-improve-Grouping-performance-tp4179926p4184295.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does DocValues improve Grouping performance ?

Posted by Joel Bernstein <jo...@gmail.com>.
A few questions so we can better understand the scale of grouping you're
trying to accomplish:

How many distinct groups do you typically have in a search result?

How many distinct groups are there in the field you are grouping on?

How many results are you trying to group in a query?

Joel Bernstein
Search Engineer at Heliosearch

On Fri, Jan 30, 2015 at 4:10 PM, Cario, Elaine <
Elaine.Cario@wolterskluwer.com> wrote:

> Hi Shamik,
>
> We use DocValues for grouping, and although I have nothing to compare it
> to (we started with DocValues), we are also seeing similar poor results as
> you: easily 60% overhead compared to non-group queries.  Looking around for
> some solution, no quick fix is presenting itself unfortunately.
> CollapsingQParserPlugin also is too limited for our needs.
>
> -----Original Message-----
> From: Shamik Bandopadhyay [mailto:shamikb@gmail.com]
> Sent: Thursday, January 15, 2015 6:02 PM
> To: solr-user@lucene.apache.org
> Subject: Does DocValues improve Grouping performance ?
>
> Hi,
>
>    Does use of DocValues provide any performance improvement for Grouping ?
> I' looked into the blog which mentions improving Grouping performance
> through DocValues.
>
> https://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/
>
> Right now, Group by queries (which I can't sadly avoid) has become a huge
> bottleneck. It has an overhead of 60-70% compared to the same query san
> group by. Unfortunately, I'm not able to be CollapsingQParserPlugin as it
> doesn't have a support similar to "group.facet" feature.
>
> My understanding on DocValues is that it's intended for faceting and
> sorting. Just wondering if anyone have tried DocValues for Grouping and saw
> any improvements ?
>
> -Thanks,
> Shamik
>

Re: Does DocValues improve Grouping performance ?

Posted by Kydryavtsev Andrey <we...@yandex.ru>.

31.01.2015, 23:23, "Michael Sokolov" <ms...@safaribooksonline.com>:
> On 1/31/2015 2:47 PM, Mikhail Khludnev wrote:
>>  Michael,
>>
>>  Please check two questions inlined below
>
> Hi Mikhail,
>>  On Sat, Jan 31, 2015 at 10:14 PM, Michael Sokolov <
>>  msokolov@safaribooksonline.com> wrote:
>>
>>  You can only handle a single relation this way since you have to
>>  restructure your index to use it; grouping is more flexible.
>>
>>  Michael,
>>  would you mind to comment which relations you need to model particularly?
>>  BJQ is definitely much restrictive than grouping, but still have some
>>  flexibility to cover the most frequent demands.
>
> This was really a theoretical comment only - in our case we only had a
> single relation (book->chapter), and the parent->child join worked out
> great.
>>  Would you mind to leave your vote
>>  https://issues.apache.org/jira/browse/SOLR-5662 it's not a big deal to
>>  implement.
>
> Sure, I just voted for the issue. In my case, I used the max score.
>
> -Mike

Re: Does DocValues improve Grouping performance ?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Oh my. Sorry SOLR-5662 <https://issues.apache.org/jira/browse/SOLR-5662> is
a duplicate for SOLR-5882, where the patch exists. I appreciate if someone
from committers can nail it down.

On Sat, Jan 31, 2015 at 11:22 PM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> On 1/31/2015 2:47 PM, Mikhail Khludnev wrote:
>
>> Michael,
>>
>> Please check two questions inlined below
>>
> Hi Mikhail,
>
>>
>> On Sat, Jan 31, 2015 at 10:14 PM, Michael Sokolov <
>> msokolov@safaribooksonline.com> wrote:
>>
>>
>> You can only handle a single relation this way since you have to
>> restructure your index to use it; grouping is more flexible.
>>
>> Michael,
>> would you mind to comment which relations you need to model particularly?
>> BJQ is definitely much restrictive than grouping, but still have some
>> flexibility to cover the most frequent demands.
>>
>>  This was really a theoretical comment only - in our case we only had a
> single relation (book->chapter), and the parent->child join worked out
> great.
>
>> Would you mind to leave your vote
>> https://issues.apache.org/jira/browse/SOLR-5662 it's not a big deal to
>> implement.
>>
>>  Sure, I just voted for the issue. In my case, I used the max score.
>
> -Mike
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Does DocValues improve Grouping performance ?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 1/31/2015 2:47 PM, Mikhail Khludnev wrote:
> Michael,
>
> Please check two questions inlined below
Hi Mikhail,
>
> On Sat, Jan 31, 2015 at 10:14 PM, Michael Sokolov <
> msokolov@safaribooksonline.com> wrote:
>
>
> You can only handle a single relation this way since you have to
> restructure your index to use it; grouping is more flexible.
>
> Michael,
> would you mind to comment which relations you need to model particularly?
> BJQ is definitely much restrictive than grouping, but still have some
> flexibility to cover the most frequent demands.
>
This was really a theoretical comment only - in our case we only had a 
single relation (book->chapter), and the parent->child join worked out 
great.
> Would you mind to leave your vote
> https://issues.apache.org/jira/browse/SOLR-5662 it's not a big deal to
> implement.
>
Sure, I just voted for the issue. In my case, I used the max score.

-Mike

Re: Does DocValues improve Grouping performance ?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Michael,

Please check two questions inlined below

On Sat, Jan 31, 2015 at 10:14 PM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> We were using grouping (no DocValues, though) and recently switched to
> using block-indexing and joins (see https://cwiki.apache.org/
> confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers).
> We got a nice speedup on average (perhaps 2x faster) and an even better
> improvement in the worst times; overall the performance is much more
> predictable and better, and I suspect (haven't checked) that we may be
> using less heap too.  The block indexing is cutting edge, a little
> complicated to get right, and I had to make some custom java code to get
> things just the way I wanted, but for best performance it does seem to be
> the way to go.
>
> Beware some gotchas:
>
> You have to reindex all the docs that participate in the parent-child
> relation so that each parent-child block gets indexed at once.  This might
> cause difficulties, but for us and I suspect most people, it's the natural
> thing to do anyway.
>
> You can only handle a single relation this way since you have to
> restructure your index to use it; grouping is more flexible.
>
Michael,
would you mind to comment which relations you need to model particularly?
BJQ is definitely much restrictive than grouping, but still have some
flexibility to cover the most frequent demands.


>
> Clients may not support the new block-indexing syntax (I think SolrJ has
> it, but the python client we were using did not);
>
> Converting an existing index requires special care; you basically have to
> delete all documents you are re-indexing
>
> The Solr query parsers don't support scoring the joined-from documents
> (child docs in the to-parent query, parent docs in the to-child query).
> This might not matter to you, but it was important for our use case
>
Would you mind to leave your vote
https://issues.apache.org/jira/browse/SOLR-5662 it's not a big deal to
implement.


> So there are some kinks still, but if you can make it work for you, it
> does seem to perform better than grouping.
>
> -Mike
>
>
> On 1/30/2015 4:10 PM, Cario, Elaine wrote:
>
>> Hi Shamik,
>>
>> We use DocValues for grouping, and although I have nothing to compare it
>> to (we started with DocValues), we are also seeing similar poor results as
>> you: easily 60% overhead compared to non-group queries.  Looking around for
>> some solution, no quick fix is presenting itself unfortunately.
>> CollapsingQParserPlugin also is too limited for our needs.
>>
>> -----Original Message-----
>> From: Shamik Bandopadhyay [mailto:shamikb@gmail.com]
>> Sent: Thursday, January 15, 2015 6:02 PM
>> To: solr-user@lucene.apache.org
>> Subject: Does DocValues improve Grouping performance ?
>>
>> Hi,
>>
>>     Does use of DocValues provide any performance improvement for
>> Grouping ?
>> I' looked into the blog which mentions improving Grouping performance
>> through DocValues.
>>
>> https://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/
>>
>> Right now, Group by queries (which I can't sadly avoid) has become a huge
>> bottleneck. It has an overhead of 60-70% compared to the same query san
>> group by. Unfortunately, I'm not able to be CollapsingQParserPlugin as it
>> doesn't have a support similar to "group.facet" feature.
>>
>> My understanding on DocValues is that it's intended for faceting and
>> sorting. Just wondering if anyone have tried DocValues for Grouping and saw
>> any improvements ?
>>
>> -Thanks,
>> Shamik
>>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Does DocValues improve Grouping performance ?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
We were using grouping (no DocValues, though) and recently switched to 
using block-indexing and joins (see 
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers). 
We got a nice speedup on average (perhaps 2x faster) and an even better 
improvement in the worst times; overall the performance is much more 
predictable and better, and I suspect (haven't checked) that we may be 
using less heap too.  The block indexing is cutting edge, a little 
complicated to get right, and I had to make some custom java code to get 
things just the way I wanted, but for best performance it does seem to 
be the way to go.

Beware some gotchas:

You have to reindex all the docs that participate in the parent-child 
relation so that each parent-child block gets indexed at once.  This 
might cause difficulties, but for us and I suspect most people, it's the 
natural thing to do anyway.

You can only handle a single relation this way since you have to 
restructure your index to use it; grouping is more flexible.

Clients may not support the new block-indexing syntax (I think SolrJ has 
it, but the python client we were using did not);

Converting an existing index requires special care; you basically have 
to delete all documents you are re-indexing

The Solr query parsers don't support scoring the joined-from documents 
(child docs in the to-parent query, parent docs in the to-child query). 
This might not matter to you, but it was important for our use case

So there are some kinks still, but if you can make it work for you, it 
does seem to perform better than grouping.

-Mike

On 1/30/2015 4:10 PM, Cario, Elaine wrote:
> Hi Shamik,
>
> We use DocValues for grouping, and although I have nothing to compare it to (we started with DocValues), we are also seeing similar poor results as you: easily 60% overhead compared to non-group queries.  Looking around for some solution, no quick fix is presenting itself unfortunately.  CollapsingQParserPlugin also is too limited for our needs.
>
> -----Original Message-----
> From: Shamik Bandopadhyay [mailto:shamikb@gmail.com]
> Sent: Thursday, January 15, 2015 6:02 PM
> To: solr-user@lucene.apache.org
> Subject: Does DocValues improve Grouping performance ?
>
> Hi,
>
>     Does use of DocValues provide any performance improvement for Grouping ?
> I' looked into the blog which mentions improving Grouping performance through DocValues.
>
> https://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/
>
> Right now, Group by queries (which I can't sadly avoid) has become a huge bottleneck. It has an overhead of 60-70% compared to the same query san group by. Unfortunately, I'm not able to be CollapsingQParserPlugin as it doesn't have a support similar to "group.facet" feature.
>
> My understanding on DocValues is that it's intended for faceting and sorting. Just wondering if anyone have tried DocValues for Grouping and saw any improvements ?
>
> -Thanks,
> Shamik


RE: Does DocValues improve Grouping performance ?

Posted by "Cario, Elaine" <El...@wolterskluwer.com>.
Hi Shamik,

We use DocValues for grouping, and although I have nothing to compare it to (we started with DocValues), we are also seeing similar poor results as you: easily 60% overhead compared to non-group queries.  Looking around for some solution, no quick fix is presenting itself unfortunately.  CollapsingQParserPlugin also is too limited for our needs.

-----Original Message-----
From: Shamik Bandopadhyay [mailto:shamikb@gmail.com] 
Sent: Thursday, January 15, 2015 6:02 PM
To: solr-user@lucene.apache.org
Subject: Does DocValues improve Grouping performance ?

Hi,

   Does use of DocValues provide any performance improvement for Grouping ?
I' looked into the blog which mentions improving Grouping performance through DocValues.

https://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/

Right now, Group by queries (which I can't sadly avoid) has become a huge bottleneck. It has an overhead of 60-70% compared to the same query san group by. Unfortunately, I'm not able to be CollapsingQParserPlugin as it doesn't have a support similar to "group.facet" feature.

My understanding on DocValues is that it's intended for faceting and sorting. Just wondering if anyone have tried DocValues for Grouping and saw any improvements ?

-Thanks,
Shamik