You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Matteo Diarena <m....@volocom.it> on 2019/01/30 09:41:39 UTC

SolrCloud become unresponsive after huge pivot facet query

Dear all,
we have a solrcloud cluster with the following features:
                - 3 zookeeper nodes
                - 4 solr nodes with:
                               - 4 CPU
                               - 16GB RAM

Each solr instance is configured as follow:
SOLR_JAVA_MEM="-Xms2g -Xmx8g"
SOLR_OPTS="$SOLR_OPTS -Dlucene.cms.override_spins=false -Dlucene.cms.override_core_count=4"

On the cluster we created a collection with 5 shards each with 2 replicas for a total of 10 replicas.

The full index size is less than 2 GB and under normal usage the used heap space is between 200MB and 500MB.

Unfortunately if we try to perform a query like the this:

.../select?q=*:*&fq=ActionType:MAILOPENING&facet=true&rows=0&facet.pivot=FIELD_ObjectId,FIELD_MailId&f.FIELD_ObjectId.facet.pivot.mincount=0&f.FIELD_ObjectId.facet.limit=-1&f.FIELD_ObjectId.facet.pivot.mincount=0&f.FIELD_ObjectId.facet.limit=-1

where FIELD_ObjectId and FIELD_MailId are high cardinality fields all the heap space is used and the entire solr cluster becomes really slow and unresponsive.
The solr instance is not killed and the heap space is never released so the only way is to get the cluster up again is to restart all the solr instances.

I know that the problem is the wrong query but I'd like to know how I can avoid this kind of problems.
Is there a way to limit the memory usage during query execution to avoid a single query to hang a cluster?

I tried to disable all caches and to investigate the heap dump but I didn't manage to find any good solution.
I also thought that an issue could be the really big search response exchange between shards. Is it possible?

Actually the cluster is not in production so I can easily perform tests or get all needed data.

Any suggestion is welcome.

Thanks a lot,
Matteo Diarena
Direttore Innovazione

Volo.com S.r.l. (www.volocom.it<http://www.volocom.it/> - volocom@pec.it<ma...@pec.it>)
Via Luigi Rizzo, 8/1 - 20151 MILANO
Via Leone XIII, 95 - 00165 ROMA

Tel +39 02 89453024 / +39 02 89453023
Fax +39 02 89453500
Mobile +39 345 2129244
m.diarena@volocom.it<ma...@volocom.it>

R: SolrCloud become unresponsive after huge pivot facet query

Posted by Matteo Diarena <m....@volocom.it>.

Hi Erick,
first of all thanks a lot for your response! 

I suppose that in my case is happening exactly what you describe as "GC Hell" because I see continuous GC cycles and solr is not showing OOM errors.

I absolutely agree with you that this is a bad query but I was wondering if there is any setting (both solr side or jvm side) to avoid that a single query can tear down an entire cluster without the possibility of an automatic recovery.
I will implement in our APIs a check on luke request handler to prevent the execution of expensive queries.

Anyway, I my opinion it could be useful a parameter like a timeAllowed working on the whole query process or even better something like a memoryAllowed parameter to kill a query if it's too expensive in terms of timing or memory consumption.

I'll study a little bit better streaming capabilities, it seems really powerful.

Best regards,
Matteo Diarena
Direttore Innovazione

Volo.com S.r.l. (www.volocom.it - volocom@pec.it)
Via Luigi Rizzo, 8/1 - 20151 MILANO 
Via Leone XIII, 95 - 00165 ROMA

Tel +39 02 89453024 / +39 02 89453023
Fax +39 02 89453500
Mobile +39 345 2129244
m.diarena@volocom.it

-----Messaggio originale-----
Da: Erick Erickson <er...@gmail.com> 
Inviato: 30 January 2019 17:44
A: solr-user <so...@lucene.apache.org>
Oggetto: Re: SolrCloud become unresponsive after huge pivot facet query

My suggestion is "don't do that" ;).

Ok, seriously. Conceptually what you have is an N-dimnensional matrix.
Each "dimension" is
one of your pivot fields with one cell for each unique value in the field. So the size is (cardinality of field 1) x (cardinality of field 2) * (cardinality of field 3) .....

To make matters worse, the results from each shard need to be aggregated, so you're correct that you're shoving potentially a _very_ large set of data across your network that then has to be sorted into the final packet.

You don't indicate that you have OOM errors so what I suspect is happening is you're in "GC hell". Each GC cycle recovers just enough memory to continue for a very short time, then stops for another GC cycle. Rinse, repeat. Timeout.

For more concrete suggestions.
1> You can use the "Luke request handler" to find the cardinality of
the fields and then
     have a blacklist of fields so you wind up rejecting these queries up front.

2> Consider the streaming capabilities. "Rollup" can be used for
     high cardinality fields.
     see: https://lucene.apache.org/solr/guide/6_6/stream-decorators.html.
    NOTE:
    "Facet" streams push the faceting down to the replicas, which you don't
    want to use in this case as it'll be the same problem. The facet streams
    are faster when they can be used, but I doubt you can in your case.
    BTW, as chance would have it, Joel B. just explained this to me ;).

Best,
Erick

On Wed, Jan 30, 2019 at 3:41 AM Matteo Diarena <m....@volocom.it> wrote:
>
> Dear all,
> we have a solrcloud cluster with the following features:
>                 - 3 zookeeper nodes
>                 - 4 solr nodes with:
>                                - 4 CPU
>                                - 16GB RAM
>
> Each solr instance is configured as follow:
> SOLR_JAVA_MEM="-Xms2g -Xmx8g"
> SOLR_OPTS="$SOLR_OPTS -Dlucene.cms.override_spins=false -Dlucene.cms.override_core_count=4"
>
> On the cluster we created a collection with 5 shards each with 2 replicas for a total of 10 replicas.
>
> The full index size is less than 2 GB and under normal usage the used heap space is between 200MB and 500MB.
>
> Unfortunately if we try to perform a query like the this:
>
> .../select?q=*:*&fq=ActionType:MAILOPENING&facet=true&rows=0&facet.piv
> ot=FIELD_ObjectId,FIELD_MailId&f.FIELD_ObjectId.facet.pivot.mincount=0
> &f.FIELD_ObjectId.facet.limit=-1&f.FIELD_ObjectId.facet.pivot.mincount
> =0&f.FIELD_ObjectId.facet.limit=-1
>
> where FIELD_ObjectId and FIELD_MailId are high cardinality fields all the heap space is used and the entire solr cluster becomes really slow and unresponsive.
> The solr instance is not killed and the heap space is never released so the only way is to get the cluster up again is to restart all the solr instances.
>
> I know that the problem is the wrong query but I'd like to know how I can avoid this kind of problems.
> Is there a way to limit the memory usage during query execution to avoid a single query to hang a cluster?
>
> I tried to disable all caches and to investigate the heap dump but I didn't manage to find any good solution.
> I also thought that an issue could be the really big search response exchange between shards. Is it possible?
>
> Actually the cluster is not in production so I can easily perform tests or get all needed data.
>
> Any suggestion is welcome.
>
> Thanks a lot,
> Matteo Diarena
> Direttore Innovazione
>
> Volo.com S.r.l. (www.volocom.it<http://www.volocom.it/> - 
> volocom@pec.it<ma...@pec.it>)
> Via Luigi Rizzo, 8/1 - 20151 MILANO
> Via Leone XIII, 95 - 00165 ROMA
>
> Tel +39 02 89453024 / +39 02 89453023
> Fax +39 02 89453500
> Mobile +39 345 2129244
> m.diarena@volocom.it<ma...@volocom.it>
>

Re: SolrCloud become unresponsive after huge pivot facet query

Posted by Shawn Heisey <ap...@elyograg.org>.

On 1/31/2019 12:11 PM, Ruchir Choudhry wrote:
> Wanted to start working on Solr bugs, will appreciate if you or some can
> allocate me with some minor bugs.

It doesn't work like that.  Issues are not handed out, it's a strictly 
volunteer system.

You'll need to find the issues you want to work on.  It's a good idea to 
check the issue to see if somebody else might be already working on it, 
and if not, add a comment saying you intend to work on it.

There are some labels that we try to give to issues that we think a new 
developer could tackle.  The following search URL should show you open 
issues tagged this way, sorted so the oldest ones are first:

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SOLR%20AND%20labels%20in%20(newdev%2Cbeginner%2Cbeginners)%20AND%20status%20in%20(Open%2CReopened)%20ORDER%20BY%20key%20ASC%20

It's a big URL, and it might get mangled by either my mail client or the 
mailing list.  If that happens, hopefully you can reconstruct it.

You can also do manual searches for topics that interest you.

Thanks,
Shawn

Re: SolrCloud become unresponsive after huge pivot facet query

Posted by Ruchir Choudhry <ru...@gmail.com>.

Hello Erick,

Wanted to start working on Solr bugs, will appreciate if you or some can
allocate me with some minor bugs.

Warm Regards,
Ruchir




On Wed, Jan 30, 2019 at 8:53 AM Erick Erickson <er...@gmail.com>
wrote:

> My suggestion is "don't do that" ;).
>
> Ok, seriously. Conceptually what you have is an N-dimnensional matrix.
> Each "dimension" is
> one of your pivot fields with one cell for each unique value in the
> field. So the size is
> (cardinality of field 1) x (cardinality of field 2) * (cardinality of
> field 3) .....
>
> To make matters worse, the results from each shard need to be
> aggregated, so you're
> correct that you're shoving potentially a _very_ large set of data
> across your network that
> then has to be sorted into the final packet.
>
> You don't indicate that you have OOM errors so what I suspect is
> happening is you're
> in "GC hell". Each GC cycle recovers just enough memory to continue
> for a very short
> time, then stops for another GC cycle. Rinse, repeat. Timeout.
>
> For more concrete suggestions.
> 1> You can use the "Luke request handler" to find the cardinality of
> the fields and then
>      have a blacklist of fields so you wind up rejecting these queries up
> front.
>
> 2> Consider the streaming capabilities. "Rollup" can be used for
>      high cardinality fields.
>      see: https://lucene.apache.org/solr/guide/6_6/stream-decorators.html.
>     NOTE:
>     "Facet" streams push the faceting down to the replicas, which you don't
>     want to use in this case as it'll be the same problem. The facet
> streams
>     are faster when they can be used, but I doubt you can in your case.
>     BTW, as chance would have it, Joel B. just explained this to me ;).
>
> Best,
> Erick
>
> On Wed, Jan 30, 2019 at 3:41 AM Matteo Diarena <m....@volocom.it>
> wrote:
> >
> > Dear all,
> > we have a solrcloud cluster with the following features:
> >                 - 3 zookeeper nodes
> >                 - 4 solr nodes with:
> >                                - 4 CPU
> >                                - 16GB RAM
> >
> > Each solr instance is configured as follow:
> > SOLR_JAVA_MEM="-Xms2g -Xmx8g"
> > SOLR_OPTS="$SOLR_OPTS -Dlucene.cms.override_spins=false
> -Dlucene.cms.override_core_count=4"
> >
> > On the cluster we created a collection with 5 shards each with 2
> replicas for a total of 10 replicas.
> >
> > The full index size is less than 2 GB and under normal usage the used
> heap space is between 200MB and 500MB.
> >
> > Unfortunately if we try to perform a query like the this:
> >
> >
> .../select?q=*:*&fq=ActionType:MAILOPENING&facet=true&rows=0&facet.pivot=FIELD_ObjectId,FIELD_MailId&f.FIELD_ObjectId.facet.pivot.mincount=0&f.FIELD_ObjectId.facet.limit=-1&f.FIELD_ObjectId.facet.pivot.mincount=0&f.FIELD_ObjectId.facet.limit=-1
> >
> > where FIELD_ObjectId and FIELD_MailId are high cardinality fields all
> the heap space is used and the entire solr cluster becomes really slow and
> unresponsive.
> > The solr instance is not killed and the heap space is never released so
> the only way is to get the cluster up again is to restart all the solr
> instances.
> >
> > I know that the problem is the wrong query but I'd like to know how I
> can avoid this kind of problems.
> > Is there a way to limit the memory usage during query execution to avoid
> a single query to hang a cluster?
> >
> > I tried to disable all caches and to investigate the heap dump but I
> didn't manage to find any good solution.
> > I also thought that an issue could be the really big search response
> exchange between shards. Is it possible?
> >
> > Actually the cluster is not in production so I can easily perform tests
> or get all needed data.
> >
> > Any suggestion is welcome.
> >
> > Thanks a lot,
> > Matteo Diarena
> > Direttore Innovazione
> >
> > Volo.com S.r.l. (www.volocom.it<http://www.volocom.it/> - volocom@pec.it
> <ma...@pec.it>)
> > Via Luigi Rizzo, 8/1 - 20151 MILANO
> > Via Leone XIII, 95 - 00165 ROMA
> >
> > Tel +39 02 89453024 / +39 02 89453023
> > Fax +39 02 89453500
> > Mobile +39 345 2129244
> > m.diarena@volocom.it<ma...@volocom.it>
> >
>

Re: SolrCloud become unresponsive after huge pivot facet query

Posted by Erick Erickson <er...@gmail.com>.

My suggestion is "don't do that" ;).

Ok, seriously. Conceptually what you have is an N-dimnensional matrix.
Each "dimension" is
one of your pivot fields with one cell for each unique value in the
field. So the size is
(cardinality of field 1) x (cardinality of field 2) * (cardinality of
field 3) .....

To make matters worse, the results from each shard need to be
aggregated, so you're
correct that you're shoving potentially a _very_ large set of data
across your network that
then has to be sorted into the final packet.

You don't indicate that you have OOM errors so what I suspect is
happening is you're
in "GC hell". Each GC cycle recovers just enough memory to continue
for a very short
time, then stops for another GC cycle. Rinse, repeat. Timeout.

For more concrete suggestions.
1> You can use the "Luke request handler" to find the cardinality of
the fields and then
     have a blacklist of fields so you wind up rejecting these queries up front.

2> Consider the streaming capabilities. "Rollup" can be used for
     high cardinality fields.
     see: https://lucene.apache.org/solr/guide/6_6/stream-decorators.html.
    NOTE:
    "Facet" streams push the faceting down to the replicas, which you don't
    want to use in this case as it'll be the same problem. The facet streams
    are faster when they can be used, but I doubt you can in your case.
    BTW, as chance would have it, Joel B. just explained this to me ;).

Best,
Erick

On Wed, Jan 30, 2019 at 3:41 AM Matteo Diarena <m....@volocom.it> wrote:
>
> Dear all,
> we have a solrcloud cluster with the following features:
>                 - 3 zookeeper nodes
>                 - 4 solr nodes with:
>                                - 4 CPU
>                                - 16GB RAM
>
> Each solr instance is configured as follow:
> SOLR_JAVA_MEM="-Xms2g -Xmx8g"
> SOLR_OPTS="$SOLR_OPTS -Dlucene.cms.override_spins=false -Dlucene.cms.override_core_count=4"
>
> On the cluster we created a collection with 5 shards each with 2 replicas for a total of 10 replicas.
>
> The full index size is less than 2 GB and under normal usage the used heap space is between 200MB and 500MB.
>
> Unfortunately if we try to perform a query like the this:
>
> .../select?q=*:*&fq=ActionType:MAILOPENING&facet=true&rows=0&facet.pivot=FIELD_ObjectId,FIELD_MailId&f.FIELD_ObjectId.facet.pivot.mincount=0&f.FIELD_ObjectId.facet.limit=-1&f.FIELD_ObjectId.facet.pivot.mincount=0&f.FIELD_ObjectId.facet.limit=-1
>
> where FIELD_ObjectId and FIELD_MailId are high cardinality fields all the heap space is used and the entire solr cluster becomes really slow and unresponsive.
> The solr instance is not killed and the heap space is never released so the only way is to get the cluster up again is to restart all the solr instances.
>
> I know that the problem is the wrong query but I'd like to know how I can avoid this kind of problems.
> Is there a way to limit the memory usage during query execution to avoid a single query to hang a cluster?
>
> I tried to disable all caches and to investigate the heap dump but I didn't manage to find any good solution.
> I also thought that an issue could be the really big search response exchange between shards. Is it possible?
>
> Actually the cluster is not in production so I can easily perform tests or get all needed data.
>
> Any suggestion is welcome.
>
> Thanks a lot,
> Matteo Diarena
> Direttore Innovazione
>
> Volo.com S.r.l. (www.volocom.it<http://www.volocom.it/> - volocom@pec.it<ma...@pec.it>)
> Via Luigi Rizzo, 8/1 - 20151 MILANO
> Via Leone XIII, 95 - 00165 ROMA
>
> Tel +39 02 89453024 / +39 02 89453023
> Fax +39 02 89453500
> Mobile +39 345 2129244
> m.diarena@volocom.it<ma...@volocom.it>
>