You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Bryant, Michael" <mi...@kcl.ac.uk> on 2017/02/09 11:58:29 UTC

Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Hi all,

I'm converting my legacy facets to JSON facets and am seeing much better performance, especially with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort facets according to a grouping field.

My situation, slightly simplified is:

Solr 4.6.1

  *   Doc set: ~200,000 docs
  *   Grouping by item_id, an indexed, stored, single value string field with ~50,000 unique values, ~4 docs per item
  *   Faceting by person_id, an indexed, stored, multi-value string field with ~50,000 values (w/ a very skewed distribution)
  *   No docValues fields

Each document here is a description of an item, and there are several descriptions per item in multiple languages.

With legacy facets I use group.field=item_id and group.facet=true, which gives me facet counts with the number of items rather than descriptions, and correctly sorted by descending item count.

With JSON facets I'm doing the equivalent like so:

&json.facet={
    "people": {
        "type": "terms",
        "field": "person_id",
        "facet": {
            "grouped_count": "unique(item_id)"
        },
        "sort": "grouped_count desc"
    }
}

This works, and is somewhat faster than legacy faceting, but it also produces a massive spike in memory usage when (and only when) the sort parameter is set to the aggregate field. A server that runs happily with a 512MB heap OOMs unless I give it a 4GB heap. With sort set to (the default) "count desc" there is no memory usage spike.

I would be curious if anyone has experienced this kind of memory usage when sorting JSON facets by stats and if there’s anything I can do to mitigate it. I’ve tried reindexing with docValues enabled on the relevant fields and it seems to make no difference in this respect.

Many thanks,
~Mike

RE: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Posted by "Bryant, Michael" <mi...@kcl.ac.uk>.
Thanks for letting me know Yonik, I'll watch this issue with interest.

BTW, I said Solr 4.6.1 in my original post - that should've been 6.4.1.

Cheers,
~Mike
________________________________________
From: Yonik Seeley [yseeley@gmail.com]
Sent: 10 February 2017 21:44
To: solr-user@lucene.apache.org
Subject: Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

FYI, I just opened https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSOLR-10122&data=01%7C01%7Cmichael.bryant%40kcl.ac.uk%7C0c5a8ff25fe5427a978c08d451fe0df9%7C8370cf1416f34c16b83c724071654356%7C0&sdata=jfzd2uMZr5DPOy6FeFMZuV4P3%2B4l1ImhQjjl9i0hvOA%3D&reserved=0 for this
-Yonik

On Fri, Feb 10, 2017 at 4:32 PM, Yonik Seeley <ys...@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
> <mi...@kcl.ac.uk> wrote:
>> Hi all,
>>
>> I'm converting my legacy facets to JSON facets and am seeing much better performance, especially with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort facets according to a grouping field.
>
> Yeah, I sort of expected this... but haven't gotten around to
> implementing something that takes less memory yet.
> If you're faceting on A and sorting by unique(B), then memory use is
> O(cardinality(A)*cardinality(B))
> We can definitely do a lot better.
>
> -Yonik

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Posted by Yonik Seeley <ys...@gmail.com>.
FYI, I just opened https://issues.apache.org/jira/browse/SOLR-10122 for this
-Yonik

On Fri, Feb 10, 2017 at 4:32 PM, Yonik Seeley <ys...@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
> <mi...@kcl.ac.uk> wrote:
>> Hi all,
>>
>> I'm converting my legacy facets to JSON facets and am seeing much better performance, especially with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort facets according to a grouping field.
>
> Yeah, I sort of expected this... but haven't gotten around to
> implementing something that takes less memory yet.
> If you're faceting on A and sorting by unique(B), then memory use is
> O(cardinality(A)*cardinality(B))
> We can definitely do a lot better.
>
> -Yonik

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Posted by Yonik Seeley <ys...@gmail.com>.
On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
<mi...@kcl.ac.uk> wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better performance, especially with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort facets according to a grouping field.

Yeah, I sort of expected this... but haven't gotten around to
implementing something that takes less memory yet.
If you're faceting on A and sorting by unique(B), then memory use is
O(cardinality(A)*cardinality(B))
We can definitely do a lot better.

-Yonik

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Posted by "Bryant, Michael" <mi...@kcl.ac.uk>.
Darn, spoke too soon. Field collapsing throws off my facet counts where facet fields differ within groups.

Back to the drawing board. FWIW, I tried hyperloglog for JSON facet aggregate counts and it has the same issue as unique() when used as the facet sort parameter - while reasonably fast it uses masses of memory.

Cheers,
~Mike

------
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 18:53, Bryant, Michael <mi...@kcl.ac.uk>> wrote:

Hi Tom,

Well the collapsing query parser is… a much better solution to my problems!  Thanks for cluing me in to this, I love it when you can delete a load of hacks for something both simpler and faster.

Best,
~Mike


------
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 14:37, Tom Evans <te...@googlemail.com>> wrote:

Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FCollapse%2Band%2BExpand%2BResults&data=01%7C01%7Cmichael.bryant%40kcl.ac.uk%7C3ff47afc049f4d3ce3ac08d451c25d84%7C8370cf1416f34c16b83c724071654356%7C0&sdata=sCjlX%2BLSh%2FdLmpMQCtKVH2wz8ESB1bZpDEkZWKxET2U%3D&reserved=0

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
<mi...@kcl.ac.uk> wrote:
Hi all,

I'm converting my legacy facets to JSON facets and am seeing much better performance, especially with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort facets according to a grouping field.

My situation, slightly simplified is:

Solr 4.6.1

*   Doc set: ~200,000 docs
*   Grouping by item_id, an indexed, stored, single value string field with ~50,000 unique values, ~4 docs per item
*   Faceting by person_id, an indexed, stored, multi-value string field with ~50,000 values (w/ a very skewed distribution)
*   No docValues fields

Each document here is a description of an item, and there are several descriptions per item in multiple languages.

With legacy facets I use group.field=item_id and group.facet=true, which gives me facet counts with the number of items rather than descriptions, and correctly sorted by descending item count.

With JSON facets I'm doing the equivalent like so:

&json.facet={
  "people": {
      "type": "terms",
      "field": "person_id",
      "facet": {
          "grouped_count": "unique(item_id)"
      },
      "sort": "grouped_count desc"
  }
}

This works, and is somewhat faster than legacy faceting, but it also produces a massive spike in memory usage when (and only when) the sort parameter is set to the aggregate field. A server that runs happily with a 512MB heap OOMs unless I give it a 4GB heap. With sort set to (the default) "count desc" there is no memory usage spike.

I would be curious if anyone has experienced this kind of memory usage when sorting JSON facets by stats and if there’s anything I can do to mitigate it. I’ve tried reindexing with docValues enabled on the relevant fields and it seems to make no difference in this respect.

Many thanks,
~Mike



Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Posted by "Bryant, Michael" <mi...@kcl.ac.uk>.
Hi Tom,

Well the collapsing query parser is… a much better solution to my problems!  Thanks for cluing me in to this, I love it when you can delete a load of hacks for something both simpler and faster.

Best,
~Mike


------
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 14:37, Tom Evans <te...@googlemail.com>> wrote:

Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FCollapse%2Band%2BExpand%2BResults&data=01%7C01%7Cmichael.bryant%40kcl.ac.uk%7C3ff47afc049f4d3ce3ac08d451c25d84%7C8370cf1416f34c16b83c724071654356%7C0&sdata=sCjlX%2BLSh%2FdLmpMQCtKVH2wz8ESB1bZpDEkZWKxET2U%3D&reserved=0

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
<mi...@kcl.ac.uk> wrote:
Hi all,

I'm converting my legacy facets to JSON facets and am seeing much better performance, especially with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort facets according to a grouping field.

My situation, slightly simplified is:

Solr 4.6.1

 *   Doc set: ~200,000 docs
 *   Grouping by item_id, an indexed, stored, single value string field with ~50,000 unique values, ~4 docs per item
 *   Faceting by person_id, an indexed, stored, multi-value string field with ~50,000 values (w/ a very skewed distribution)
 *   No docValues fields

Each document here is a description of an item, and there are several descriptions per item in multiple languages.

With legacy facets I use group.field=item_id and group.facet=true, which gives me facet counts with the number of items rather than descriptions, and correctly sorted by descending item count.

With JSON facets I'm doing the equivalent like so:

&json.facet={
   "people": {
       "type": "terms",
       "field": "person_id",
       "facet": {
           "grouped_count": "unique(item_id)"
       },
       "sort": "grouped_count desc"
   }
}

This works, and is somewhat faster than legacy faceting, but it also produces a massive spike in memory usage when (and only when) the sort parameter is set to the aggregate field. A server that runs happily with a 512MB heap OOMs unless I give it a 4GB heap. With sort set to (the default) "count desc" there is no memory usage spike.

I would be curious if anyone has experienced this kind of memory usage when sorting JSON facets by stats and if there’s anything I can do to mitigate it. I’ve tried reindexing with docValues enabled on the relevant fields and it seems to make no difference in this respect.

Many thanks,
~Mike


Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Posted by Tom Evans <te...@googlemail.com>.
Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
<mi...@kcl.ac.uk> wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better performance, especially with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort facets according to a grouping field.
>
> My situation, slightly simplified is:
>
> Solr 4.6.1
>
>   *   Doc set: ~200,000 docs
>   *   Grouping by item_id, an indexed, stored, single value string field with ~50,000 unique values, ~4 docs per item
>   *   Faceting by person_id, an indexed, stored, multi-value string field with ~50,000 values (w/ a very skewed distribution)
>   *   No docValues fields
>
> Each document here is a description of an item, and there are several descriptions per item in multiple languages.
>
> With legacy facets I use group.field=item_id and group.facet=true, which gives me facet counts with the number of items rather than descriptions, and correctly sorted by descending item count.
>
> With JSON facets I'm doing the equivalent like so:
>
> &json.facet={
>     "people": {
>         "type": "terms",
>         "field": "person_id",
>         "facet": {
>             "grouped_count": "unique(item_id)"
>         },
>         "sort": "grouped_count desc"
>     }
> }
>
> This works, and is somewhat faster than legacy faceting, but it also produces a massive spike in memory usage when (and only when) the sort parameter is set to the aggregate field. A server that runs happily with a 512MB heap OOMs unless I give it a 4GB heap. With sort set to (the default) "count desc" there is no memory usage spike.
>
> I would be curious if anyone has experienced this kind of memory usage when sorting JSON facets by stats and if there’s anything I can do to mitigate it. I’ve tried reindexing with docValues enabled on the relevant fields and it seems to make no difference in this respect.
>
> Many thanks,
> ~Mike