You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Susheel Kumar <su...@gmail.com> on 2017/12/18 14:07:19 UTC

OOM spreads to other replica's/HA when OOM

Hello,

I was testing Solr to see if a query which would cause OOM and would limit
the OOM issue to only the replica set's which gets hit first.

But the behavior I see that after all set of first replica's went down due
to OOM (gone on cloud view) other replica's starts also getting down. Total
6 shards I have with each shard having 2 replica's and on separate machines

The expected behavior is that all shards replica which gets hit first
should go down due to OOM and then other replica's should survive and
provide High Availability.

The setup I am testing with is Solr 6.0 and wondering if this is would
remain same with 6.6 or there has been some known improvements made to
avoid spreading OOM to second/third set of replica's and causing whole
cluster to down.

Any info on this is appreciated.

Thanks,
Susheel

Re: OOM spreads to other replica's/HA when OOM

Posted by David Hastings <ha...@gmail.com>.

We put nginx servers in front of our three solr stand alone servers and
three node gallera cluster, it works very well and the amount of control it
gives you is really helpful.

On Tue, Dec 19, 2017 at 10:58 AM, Walter Underwood <wu...@wunderwood.org>
wrote:

> > On Dec 19, 2017, at 7:38 AM, Toke Eskildsen <to...@kb.dk> wrote:
> >
> > Let's say we change Solr, so that it does not re-issue queries that
> > caused nodes to fail. Unfortunately that does not solve your problem as
> > the user will do what users do on an internal server error: Press
> > reload.
>
> That would work, because good load balancers can shed excess load. Amazon
> does not offer good load balancers, but I was using this feature ten years
> ago with others.
>
> I think we’ll be putting nginx in front of every Solr instance, listening
> on a different port, and limiting traffic with that.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>
>

Re: OOM spreads to other replica's/HA when OOM

Posted by Walter Underwood <wu...@wunderwood.org>.

> On Dec 19, 2017, at 7:38 AM, Toke Eskildsen <to...@kb.dk> wrote:
> 
> Let's say we change Solr, so that it does not re-issue queries that
> caused nodes to fail. Unfortunately that does not solve your problem as
> the user will do what users do on an internal server error: Press
> reload.

That would work, because good load balancers can shed excess load. Amazon does not offer good load balancers, but I was using this feature ten years ago with others.

I think we’ll be putting nginx in front of every Solr instance, listening on a different port, and limiting traffic with that.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: OOM spreads to other replica's/HA when OOM

Posted by Toke Eskildsen <to...@kb.dk>.

On Mon, 2017-12-18 at 15:56 -0500, Susheel Kumar wrote:
> Technically I agree Shawn with you on fixing OOME cause, Infact it is
> not an issue any more but I was testing for HA when planing for any
> failures.
> Same time it's hard to convince Business folks that HA wouldn't be
> there in case of OOME.

Let's say we change Solr, so that it does not re-issue queries that
caused nodes to fail. Unfortunately that does not solve your problem as
the user will do what users do on an internal server error: Press
reload.

So for a mechanism to work it would require the Solr cloud to maintain
a blacklist of queries that causes nodes to fail. But if it is paging
related, the user might try pressing "next" instead and then the query
will be different from the previous one, but still cause OOM. So maybe
a mechanism for detecting multiple OOM-triggering queries from the same
user and then blacklisting the user? But what if the query is a link
shared on a forum? And so forth.

Hardening by blacklisting is a game that is hard to win. So to
paraphrase Shawn: Make sure your users cannot issue OOMing queries.

- Toke Eskildsen, Royal Danish Library - Aarhus

Re: OOM spreads to other replica's/HA when OOM

Posted by Emir Arnautović <em...@sematext.com>.

Hi Susheel,
If a single query can cause node to fail and if retry cause replicas to be affected (still to be confirmed) then preventing retry logic on Solr side can only partially solve that issue - retry logic can exist on client side and it will result in replicas’ OOM. Again, not sure if Solr retries (Solrj does and would expect the same code base is used within Solr as well) and on what conditions, but maybe using shorter timeAllowed would help you in some cases. Also maybe using preferLocalShards would result in aggregating node to OOM, but that could result in client retry.

I agree with Shown that only true solution is to protect Solr from OOM - e.g. control max start and rows.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Dec 2017, at 21:56, Susheel Kumar <su...@gmail.com> wrote:
> 
> Technically I agree Shawn with you on fixing OOME cause, Infact it is not
> an issue any more but I was testing for HA when planing for any failures.
> Same time it's hard to convince Business folks that HA wouldn't be there in
> case of OOME.
> 
> I think the best option is to enable timeAllowed for now.
> 
> Thanks,
> Susheel
> 
> On Mon, Dec 18, 2017 at 11:37 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 12/18/2017 9:01 AM, Susheel Kumar wrote:
>>> Any thoughts on how one can provide HA in these situations.
>> 
>> As I have said already a couple of times today on other threads, there
>> are *exactly* two ways to deal with OOME.  No other solution is possible.
>> 
>> 1) Configure the system to allow the process to access more of the
>> resource that it's running out of.  This is typically the solution that
>> people will utilize.  In your case, you would need to make the heap larger.
>> 
>> 2) Change the configuration or the environment so fewer resources are
>> required.
>> 
>> OOME is special.  It is a problem that all the high availability steps
>> in the world cannot protect you from, for precisely the reasons that
>> Emir and I have described.  You must ensure that Solr is set up so there
>> are enough resources that OOME cannot occur.
>> 
>> I can see a general argument for making it possible to configure or
>> disable any retry mechanism in SolrCloud, but that is not the solution
>> here.  It would most likely only *delay* the problem to a later query.
>> The OOME itself must be fixed, using one of the two solutions already
>> outlined.
>> 
>> Thanks,
>> Shawn
>> 
>>

Re: OOM spreads to other replica's/HA when OOM

Posted by Bojan Vukojevic <em...@gmail.com>.

UNSUBSCRIBE

On Mon, Dec 18, 2017 at 12:57 PM Susheel Kumar <su...@gmail.com>
wrote:

> Technically I agree Shawn with you on fixing OOME cause, Infact it is not
> an issue any more but I was testing for HA when planing for any failures.
> Same time it's hard to convince Business folks that HA wouldn't be there in
> case of OOME.
>
> I think the best option is to enable timeAllowed for now.
>
> Thanks,
> Susheel
>
> On Mon, Dec 18, 2017 at 11:37 AM, Shawn Heisey <ap...@elyograg.org>
> wrote:
>
> > On 12/18/2017 9:01 AM, Susheel Kumar wrote:
> > > Any thoughts on how one can provide HA in these situations.
> >
> > As I have said already a couple of times today on other threads, there
> > are *exactly* two ways to deal with OOME.  No other solution is possible.
> >
> > 1) Configure the system to allow the process to access more of the
> > resource that it's running out of.  This is typically the solution that
> > people will utilize.  In your case, you would need to make the heap
> larger.
> >
> > 2) Change the configuration or the environment so fewer resources are
> > required.
> >
> > OOME is special.  It is a problem that all the high availability steps
> > in the world cannot protect you from, for precisely the reasons that
> > Emir and I have described.  You must ensure that Solr is set up so there
> > are enough resources that OOME cannot occur.
> >
> > I can see a general argument for making it possible to configure or
> > disable any retry mechanism in SolrCloud, but that is not the solution
> > here.  It would most likely only *delay* the problem to a later query.
> > The OOME itself must be fixed, using one of the two solutions already
> > outlined.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: OOM spreads to other replica's/HA when OOM

Posted by Susheel Kumar <su...@gmail.com>.

Technically I agree Shawn with you on fixing OOME cause, Infact it is not
an issue any more but I was testing for HA when planing for any failures.
Same time it's hard to convince Business folks that HA wouldn't be there in
case of OOME.

I think the best option is to enable timeAllowed for now.

Thanks,
Susheel

On Mon, Dec 18, 2017 at 11:37 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 12/18/2017 9:01 AM, Susheel Kumar wrote:
> > Any thoughts on how one can provide HA in these situations.
>
> As I have said already a couple of times today on other threads, there
> are *exactly* two ways to deal with OOME.  No other solution is possible.
>
> 1) Configure the system to allow the process to access more of the
> resource that it's running out of.  This is typically the solution that
> people will utilize.  In your case, you would need to make the heap larger.
>
> 2) Change the configuration or the environment so fewer resources are
> required.
>
> OOME is special.  It is a problem that all the high availability steps
> in the world cannot protect you from, for precisely the reasons that
> Emir and I have described.  You must ensure that Solr is set up so there
> are enough resources that OOME cannot occur.
>
> I can see a general argument for making it possible to configure or
> disable any retry mechanism in SolrCloud, but that is not the solution
> here.  It would most likely only *delay* the problem to a later query.
> The OOME itself must be fixed, using one of the two solutions already
> outlined.
>
> Thanks,
> Shawn
>
>

Re: OOM spreads to other replica's/HA when OOM

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/18/2017 9:01 AM, Susheel Kumar wrote:
> Any thoughts on how one can provide HA in these situations.

As I have said already a couple of times today on other threads, there
are *exactly* two ways to deal with OOME.  No other solution is possible.

1) Configure the system to allow the process to access more of the
resource that it's running out of.  This is typically the solution that
people will utilize.  In your case, you would need to make the heap larger.

2) Change the configuration or the environment so fewer resources are
required.

OOME is special.  It is a problem that all the high availability steps
in the world cannot protect you from, for precisely the reasons that
Emir and I have described.  You must ensure that Solr is set up so there
are enough resources that OOME cannot occur.

I can see a general argument for making it possible to configure or
disable any retry mechanism in SolrCloud, but that is not the solution
here.  It would most likely only *delay* the problem to a later query. 
The OOME itself must be fixed, using one of the two solutions already
outlined.

Thanks,
Shawn

Re: OOM spreads to other replica's/HA when OOM

Posted by Susheel Kumar <su...@gmail.com>.

Shawn/Emir - its the Java heap space issue.  I can see in GCViewer sudden
heap utilization and finally Full GC lines and oom killer script killing
the solr.

What I wonder is if there is retry from coordinating node which is causing
this OOM query to spread to next set of replica's then how can we tune /
change this behavior. Otherwise even though we have higher replication
factor > 1 but still HA is not guaranteed in this situation which defeats
the purpose...

If we can't control this retry by coordinating node then I would say we
have something fundamental wrong.  I know "timeAllowed" may save us in some
of these scenario's but if OOM happens before "timeAllowed"+extraTime (it
takes to really kill the query) we still have the issue.

Any thoughts on how one can provide HA in these situations.

Thanks,
Susheel



On Mon, Dec 18, 2017 at 9:53 AM, Emir Arnautović <
emir.arnautovic@sematext.com> wrote:

> Ah, I misunderstood your usecase - it is not node that receives query that
> OOMs but nodes that are included in distributed queries are the one that
> OOMs. I would also say that it is expected because queries to particular
> shards fails and coordinating node retries using other replicas causing all
> replicas to fail. I did not check the code, but I would expect to have some
> sort of retry mechanism in place.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 18 Dec 2017, at 15:36, Susheel Kumar <su...@gmail.com> wrote:
> >
> > Yes, Emir.  If I repeat the query, it will spread to other nodes but
> that's
> > not the case.  This is my test env and i am deliberately executing the
> > query with very high offset and wildcard to cause OOM but executing only
> > one time.
> >
> > So it shouldn't spread to other replica sets and at the end of my test,
> > the first 6 shard/replica set's which gets hit should go down while
> other 6
> > should survive but that's not what I see at the end.
> >
> > Setup :  400+ million docs, JVM is 12GB.  Yes, only one collection. Total
> > 12 machines with 6 shards and 6 replica's (replicationFactor = 2)
> >
> > On Mon, Dec 18, 2017 at 9:22 AM, Emir Arnautović <
> > emir.arnautovic@sematext.com> wrote:
> >
> >> Hi Susheel,
> >> The fact that only node that received query OOM tells that it is about
> >> merging results from all shards and providing final result. It is
> expected
> >> that repeating the same query on some other node will result in a
> similar
> >> behaviour - it just mean that Solr does not have enough memory to
> execute
> >> this heavy query.
> >> Can you share more details on your test: size of collection, type of
> >> query, expected number of results, JVM settings, is that the only
> >> collection on cluster etc.
> >>
> >> Thanks,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 18 Dec 2017, at 15:07, Susheel Kumar <su...@gmail.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>> I was testing Solr to see if a query which would cause OOM and would
> >> limit
> >>> the OOM issue to only the replica set's which gets hit first.
> >>>
> >>> But the behavior I see that after all set of first replica's went down
> >> due
> >>> to OOM (gone on cloud view) other replica's starts also getting down.
> >> Total
> >>> 6 shards I have with each shard having 2 replica's and on separate
> >> machines
> >>>
> >>> The expected behavior is that all shards replica which gets hit first
> >>> should go down due to OOM and then other replica's should survive and
> >>> provide High Availability.
> >>>
> >>> The setup I am testing with is Solr 6.0 and wondering if this is would
> >>> remain same with 6.6 or there has been some known improvements made to
> >>> avoid spreading OOM to second/third set of replica's and causing whole
> >>> cluster to down.
> >>>
> >>> Any info on this is appreciated.
> >>>
> >>> Thanks,
> >>> Susheel
> >>
> >>
>
>

Re: OOM spreads to other replica's/HA when OOM

Posted by Emir Arnautović <em...@sematext.com>.

Ah, I misunderstood your usecase - it is not node that receives query that OOMs but nodes that are included in distributed queries are the one that OOMs. I would also say that it is expected because queries to particular shards fails and coordinating node retries using other replicas causing all replicas to fail. I did not check the code, but I would expect to have some sort of retry mechanism in place.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Dec 2017, at 15:36, Susheel Kumar <su...@gmail.com> wrote:
> 
> Yes, Emir.  If I repeat the query, it will spread to other nodes but that's
> not the case.  This is my test env and i am deliberately executing the
> query with very high offset and wildcard to cause OOM but executing only
> one time.
> 
> So it shouldn't spread to other replica sets and at the end of my test,
> the first 6 shard/replica set's which gets hit should go down while other 6
> should survive but that's not what I see at the end.
> 
> Setup :  400+ million docs, JVM is 12GB.  Yes, only one collection. Total
> 12 machines with 6 shards and 6 replica's (replicationFactor = 2)
> 
> On Mon, Dec 18, 2017 at 9:22 AM, Emir Arnautović <
> emir.arnautovic@sematext.com> wrote:
> 
>> Hi Susheel,
>> The fact that only node that received query OOM tells that it is about
>> merging results from all shards and providing final result. It is expected
>> that repeating the same query on some other node will result in a similar
>> behaviour - it just mean that Solr does not have enough memory to execute
>> this heavy query.
>> Can you share more details on your test: size of collection, type of
>> query, expected number of results, JVM settings, is that the only
>> collection on cluster etc.
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 18 Dec 2017, at 15:07, Susheel Kumar <su...@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>> I was testing Solr to see if a query which would cause OOM and would
>> limit
>>> the OOM issue to only the replica set's which gets hit first.
>>> 
>>> But the behavior I see that after all set of first replica's went down
>> due
>>> to OOM (gone on cloud view) other replica's starts also getting down.
>> Total
>>> 6 shards I have with each shard having 2 replica's and on separate
>> machines
>>> 
>>> The expected behavior is that all shards replica which gets hit first
>>> should go down due to OOM and then other replica's should survive and
>>> provide High Availability.
>>> 
>>> The setup I am testing with is Solr 6.0 and wondering if this is would
>>> remain same with 6.6 or there has been some known improvements made to
>>> avoid spreading OOM to second/third set of replica's and causing whole
>>> cluster to down.
>>> 
>>> Any info on this is appreciated.
>>> 
>>> Thanks,
>>> Susheel
>> 
>>

Re: OOM spreads to other replica's/HA when OOM

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/18/2017 7:36 AM, Susheel Kumar wrote:
> Yes, Emir.  If I repeat the query, it will spread to other nodes but that's
> not the case.  This is my test env and i am deliberately executing the
> query with very high offset and wildcard to cause OOM but executing only
> one time.
>
> So it shouldn't spread to other replica sets and at the end of my test,
> the first 6 shard/replica set's which gets hit should go down while other 6
> should survive but that's not what I see at the end.
>
> Setup :  400+ million docs, JVM is 12GB.  Yes, only one collection. Total
> 12 machines with 6 shards and 6 replica's (replicationFactor = 2)

Do you know what the exact OOME you are encountering is? Is it "java 
heap space" or something else?

While ordinarily I would expect multiple replicas in SolrCloud to ensure 
high availability, OutOfMemoryError is a special class of problem.  When 
you encounter OOME on one server, it is likely that other similarly 
equipped/configured servers in the cloud will *also* encounter the 
error, and if that happens, you're not going to have high availability.

To eliminate the problem, you're going to have to figure out which 
resource is being depleted and either increase that resource or change 
things so that Solr doesn't need as much of that resource.

Thanks,
Shawn

Re: OOM spreads to other replica's/HA when OOM

Posted by Susheel Kumar <su...@gmail.com>.

Yes, Emir.  If I repeat the query, it will spread to other nodes but that's
not the case.  This is my test env and i am deliberately executing the
query with very high offset and wildcard to cause OOM but executing only
one time.

So it shouldn't spread to other replica sets and at the end of my test,
the first 6 shard/replica set's which gets hit should go down while other 6
should survive but that's not what I see at the end.

Setup :  400+ million docs, JVM is 12GB.  Yes, only one collection. Total
12 machines with 6 shards and 6 replica's (replicationFactor = 2)

On Mon, Dec 18, 2017 at 9:22 AM, Emir Arnautović <
emir.arnautovic@sematext.com> wrote:

> Hi Susheel,
> The fact that only node that received query OOM tells that it is about
> merging results from all shards and providing final result. It is expected
> that repeating the same query on some other node will result in a similar
> behaviour - it just mean that Solr does not have enough memory to execute
> this heavy query.
> Can you share more details on your test: size of collection, type of
> query, expected number of results, JVM settings, is that the only
> collection on cluster etc.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 18 Dec 2017, at 15:07, Susheel Kumar <su...@gmail.com> wrote:
> >
> > Hello,
> >
> > I was testing Solr to see if a query which would cause OOM and would
> limit
> > the OOM issue to only the replica set's which gets hit first.
> >
> > But the behavior I see that after all set of first replica's went down
> due
> > to OOM (gone on cloud view) other replica's starts also getting down.
> Total
> > 6 shards I have with each shard having 2 replica's and on separate
> machines
> >
> > The expected behavior is that all shards replica which gets hit first
> > should go down due to OOM and then other replica's should survive and
> > provide High Availability.
> >
> > The setup I am testing with is Solr 6.0 and wondering if this is would
> > remain same with 6.6 or there has been some known improvements made to
> > avoid spreading OOM to second/third set of replica's and causing whole
> > cluster to down.
> >
> > Any info on this is appreciated.
> >
> > Thanks,
> > Susheel
>
>

Re: OOM spreads to other replica's/HA when OOM

Posted by Emir Arnautović <em...@sematext.com>.

Hi Susheel,
The fact that only node that received query OOM tells that it is about merging results from all shards and providing final result. It is expected that repeating the same query on some other node will result in a similar behaviour - it just mean that Solr does not have enough memory to execute this heavy query.
Can you share more details on your test: size of collection, type of query, expected number of results, JVM settings, is that the only collection on cluster etc.

Thanks,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

> On 18 Dec 2017, at 15:07, Susheel Kumar <su...@gmail.com> wrote:
> 
> Hello,
> 
> I was testing Solr to see if a query which would cause OOM and would limit
> the OOM issue to only the replica set's which gets hit first.
> 
> But the behavior I see that after all set of first replica's went down due
> to OOM (gone on cloud view) other replica's starts also getting down. Total
> 6 shards I have with each shard having 2 replica's and on separate machines
> 
> The expected behavior is that all shards replica which gets hit first
> should go down due to OOM and then other replica's should survive and
> provide High Availability.
> 
> The setup I am testing with is Solr 6.0 and wondering if this is would
> remain same with 6.6 or there has been some known improvements made to
> avoid spreading OOM to second/third set of replica's and causing whole
> cluster to down.
> 
> Any info on this is appreciated.
> 
> Thanks,
> Susheel