You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Anthony Troiano <dt...@basistech.com> on 2013/11/13 20:46:51 UTC

field collapsing performance in sharded environment

Hello,

I'm hitting a performance issue when using field collapsing in a
distributed Solr setup and I'm wondering if others have seen it and if
anyone has an idea to work around. it.

I'm using field collapsing to deduplicate documents that have the same near
duplicate hash value, and deduplicating at query time (as opposed to
filtering at index time) is a requirement.  I have a sharded setup with 10
cores (not SolrCloud), each having ~1000 documents each.  Of the 10k docs,
most have a unique near duplicate hash value, so there are about 10k unique
values for the field that I'm grouping on.  The grouping parameters that
I'm using are:

group=true
group.field=<near dupe hash field>
group.main=true

I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
difference is the absence or presence of these three grouping parameters
and I'm consistently seeing a marked difference in performance (as a
representative data point, 200ms latency without grouping and 1600ms with
grouping).  Interestingly, if I put all 10k docs on the same core and query
that core independently with and without grouping, I don't see much of a
latency difference, so the performance degradation seems to exist only in
the sharded setup.

Is there a known performance issue when field collapsing in a sharded setup
(perhaps only manifests when the grouping field has many unique values), or
have other people observed this?  Any ideas for a workaround?  Note that
docs in my sharded setup can only have the same signature if they're in the
same shard, so perhaps that can be used to boost perf, though I don't see
an exposed way to do so.

A follow-on question is whether we're likely to see the same issue if /
when we move to SolrCloud.

Thanks,
Dave

Re: field collapsing performance in sharded environment

Posted by Paul Masurel <pa...@gmail.com>.

That's not the way grouping is done.
On a first round all shards return their 10 best group (represented as
their 10 best grouping values).

As a result it's a three round thing instead of the two round for regular
search, so observing an increasing in latency is normal but not in the
realm of what you are seeing here.

Most probably it is due to the performance issue of TermAllGroupsCollector
which you can patch very easily.


On Thu, Nov 14, 2013 at 3:56 PM, Erick Erickson <er...@gmail.com>wrote:

> bq:   Of the 10k docs,
> most have a unique near duplicate hash value, so there are about 10k unique
> values for the field that I'm grouping on.
>
> I suspect (but don't know the grouping code well) that this is the issue.
> You're
> getting the top N groups, right? But in the general case, you can't insure
> that the
> topN from shard1 has any relation to the topN from shard2. So I _suspect_
> that
> the code returns all of the groups. Say that shard1 for group 5 has 3 docs,
> but
> for shard2 has 3,000 docs. Do get the true top N, you need to collate all
> the values
> from all the groups; you can't just return the top 10 groups from each
> shard and
> get correct counts.
>
> Since your group cardinality is about 10K/shard, you're pushing 10 packets
> each
> containing 10K entries back to the originating shard, which has to
> combine/sort
> them all to get the true top N. At least that's my theory.
>
> Your situation is special in that you say that your groups don't appear on
> more than
> one shard, so you'd probably have to write something that aborted this
> behavior and
> returned only the top N, if I'm right.
>
> But that begs the question of why you're doing this. What purpose is served
> by
> grouping on documents that probably only have 1 member?
>
> Best,
> Erick
>
>
> On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano <
> dtroiano@basistech.com> wrote:
>
> > Hello,
> >
> > I'm hitting a performance issue when using field collapsing in a
> > distributed Solr setup and I'm wondering if others have seen it and if
> > anyone has an idea to work around. it.
> >
> > I'm using field collapsing to deduplicate documents that have the same
> near
> > duplicate hash value, and deduplicating at query time (as opposed to
> > filtering at index time) is a requirement.  I have a sharded setup with
> 10
> > cores (not SolrCloud), each having ~1000 documents each.  Of the 10k
> docs,
> > most have a unique near duplicate hash value, so there are about 10k
> unique
> > values for the field that I'm grouping on.  The grouping parameters that
> > I'm using are:
> >
> > group=true
> > group.field=<near dupe hash field>
> > group.main=true
> >
> > I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
> > difference is the absence or presence of these three grouping parameters
> > and I'm consistently seeing a marked difference in performance (as a
> > representative data point, 200ms latency without grouping and 1600ms with
> > grouping).  Interestingly, if I put all 10k docs on the same core and
> query
> > that core independently with and without grouping, I don't see much of a
> > latency difference, so the performance degradation seems to exist only in
> > the sharded setup.
> >
> > Is there a known performance issue when field collapsing in a sharded
> setup
> > (perhaps only manifests when the grouping field has many unique values),
> or
> > have other people observed this?  Any ideas for a workaround?  Note that
> > docs in my sharded setup can only have the same signature if they're in
> the
> > same shard, so perhaps that can be used to boost perf, though I don't see
> > an exposed way to do so.
> >
> > A follow-on question is whether we're likely to see the same issue if /
> > when we move to SolrCloud.
> >
> > Thanks,
> > Dave
> >
>



-- 
______________________________________________

 Masurel Paul
 e-mail: paul.masurel@gmail.com

Re: field collapsing performance in sharded environment

Posted by Erick Erickson <er...@gmail.com>.

bq:   Of the 10k docs,
most have a unique near duplicate hash value, so there are about 10k unique
values for the field that I'm grouping on.

I suspect (but don't know the grouping code well) that this is the issue.
You're
getting the top N groups, right? But in the general case, you can't insure
that the
topN from shard1 has any relation to the topN from shard2. So I _suspect_
that
the code returns all of the groups. Say that shard1 for group 5 has 3 docs,
but
for shard2 has 3,000 docs. Do get the true top N, you need to collate all
the values
from all the groups; you can't just return the top 10 groups from each
shard and
get correct counts.

Since your group cardinality is about 10K/shard, you're pushing 10 packets
each
containing 10K entries back to the originating shard, which has to
combine/sort
them all to get the true top N. At least that's my theory.

Your situation is special in that you say that your groups don't appear on
more than
one shard, so you'd probably have to write something that aborted this
behavior and
returned only the top N, if I'm right.

But that begs the question of why you're doing this. What purpose is served
by
grouping on documents that probably only have 1 member?

Best,
Erick

On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano <
dtroiano@basistech.com> wrote:

> Hello,
>
> I'm hitting a performance issue when using field collapsing in a
> distributed Solr setup and I'm wondering if others have seen it and if
> anyone has an idea to work around. it.
>
> I'm using field collapsing to deduplicate documents that have the same near
> duplicate hash value, and deduplicating at query time (as opposed to
> filtering at index time) is a requirement.  I have a sharded setup with 10
> cores (not SolrCloud), each having ~1000 documents each.  Of the 10k docs,
> most have a unique near duplicate hash value, so there are about 10k unique
> values for the field that I'm grouping on.  The grouping parameters that
> I'm using are:
>
> group=true
> group.field=<near dupe hash field>
> group.main=true
>
> I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
> difference is the absence or presence of these three grouping parameters
> and I'm consistently seeing a marked difference in performance (as a
> representative data point, 200ms latency without grouping and 1600ms with
> grouping).  Interestingly, if I put all 10k docs on the same core and query
> that core independently with and without grouping, I don't see much of a
> latency difference, so the performance degradation seems to exist only in
> the sharded setup.
>
> Is there a known performance issue when field collapsing in a sharded setup
> (perhaps only manifests when the grouping field has many unique values), or
> have other people observed this?  Any ideas for a workaround?  Note that
> docs in my sharded setup can only have the same signature if they're in the
> same shard, so perhaps that can be used to boost perf, though I don't see
> an exposed way to do so.
>
> A follow-on question is whether we're likely to see the same issue if /
> when we move to SolrCloud.
>
> Thanks,
> Dave
>

Re: field collapsing performance in sharded environment

Posted by Otis Gospodnetic <ot...@gmail.com>.

Have a look at https://issues.apache.org/jira/browse/SOLR-5027 +
https://wiki.apache.org/solr/CollapsingQParserPlugin

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano <
dtroiano@basistech.com> wrote:

> Hello,
>
> I'm hitting a performance issue when using field collapsing in a
> distributed Solr setup and I'm wondering if others have seen it and if
> anyone has an idea to work around. it.
>
> I'm using field collapsing to deduplicate documents that have the same near
> duplicate hash value, and deduplicating at query time (as opposed to
> filtering at index time) is a requirement.  I have a sharded setup with 10
> cores (not SolrCloud), each having ~1000 documents each.  Of the 10k docs,
> most have a unique near duplicate hash value, so there are about 10k unique
> values for the field that I'm grouping on.  The grouping parameters that
> I'm using are:
>
> group=true
> group.field=<near dupe hash field>
> group.main=true
>
> I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
> difference is the absence or presence of these three grouping parameters
> and I'm consistently seeing a marked difference in performance (as a
> representative data point, 200ms latency without grouping and 1600ms with
> grouping).  Interestingly, if I put all 10k docs on the same core and query
> that core independently with and without grouping, I don't see much of a
> latency difference, so the performance degradation seems to exist only in
> the sharded setup.
>
> Is there a known performance issue when field collapsing in a sharded setup
> (perhaps only manifests when the grouping field has many unique values), or
> have other people observed this?  Any ideas for a workaround?  Note that
> docs in my sharded setup can only have the same signature if they're in the
> same shard, so perhaps that can be used to boost perf, though I don't see
> an exposed way to do so.
>
> A follow-on question is whether we're likely to see the same issue if /
> when we move to SolrCloud.
>
> Thanks,
> Dave
>