You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Kojo <rb...@gmail.com> on 2019/08/12 11:47:27 UTC

Solr cloud questions

Hi,
I am using Solr cloud on this configuration:

2 boxes (one Solr in each box)
4 instances per box

At this moment I have an active collections with about 300.000 docs. The
other collections are not being queried. The acctive collection is
configured:
- shards: 16
- replication factor: 2

These two Solrs (Solr1 and Solr2) use Zookeper (one box, one instance. No
zookeeper cluster)

My application point to Solr1, and everything works fine, until suddenly on
instance of this Solr1 dies. This istance is on port 8983, the "main"
instance. I thought it could be related to memory usage, but we increase
RAM and JVM memory but it still dies.
The Solr1, the one wich dies,is the destination where I point my web
application.

Here I have two questions that I hope you can help me:

1. Which log can I look for debug this issue?
2. After this instance dies, the Solr cloud does not answer to my web
application. Is this correct? I thougth that the replicas should answer if
one shard, instance or one box goes down.

Regards,
Koji

Re: Solr cloud questions

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/15/2019 8:14 AM, Kojo wrote:
> I am starting to think that my setup has more than one problem.
> As I said before, I am not balancing my load to Solr nodes, and I have
> eight nodes. All of my web application requests go to one Solr node, the
> only one that dies. If I distribute the load across the other nodes, is it
> possible that these problems may end?
> 
> Even if I downsize the Solr cloud setup to 2 boxes 2 nodes each with less
> shards than the 16 shards that I have now, I would like to know your
> oppinion about the question above.

Based on those GC logs, we have 58 hours of good steady operation, 
followed by something bad.  Something happened in those few minutes that 
*didn't* happen in the previous 58 hours.

You could try increasing the heap beyond 6GB, but depending on what went 
wrong, that might not help.  And as Erick was hinting at, large heaps 
can create their own problems.

The better option is to figure out what's happening when it all goes bad 
and keep that from happening.  Load balancing might help, or it might 
cause whatever's happening on the one node to happen to all your nodes.

Thanks,
Shawn

Re: Solr cloud questions

Posted by Kojo <rb...@gmail.com>.

Erick,
I am using Python, so I think SolrJ is not an option. I wrote my libs to
connect to Solr and interpret Solr data.

I will try to load balance via Apache that is in front of Solr, before I
change my setup, I think it will be simpler. I was not aware about the
single point of failure on Solr Cloud when I set my infra.

Thank you so much for your help,
Koji





Em qui, 15 de ago de 2019 às 14:11, Erick Erickson <er...@gmail.com>
escreveu:

> OK, if you’re sending HTTP requests to a single node, that’s
> something of an anti-pattern unless it’s a load balancer that
> sends request to random nodes in the cluster. Do note that
> even if you do send all http requests to one node, the top-level
> request will be forwarded to other nodes in the cluster.
>
> But if your single node dies, then indeed there’s no way for Solr
> to get the request in the other nodes.
>
> If you use SolrJ, in particular CloudSolrClient, it’s ZooKeeper-aware
> and will both avoid dead nodes _and_ distribute the top-level
> queries to all the Solr nodes. It’ll also be informed when a dead
> nodes comes back and put it back into the rotation.
>
> Best,
> Erick
>
> > On Aug 15, 2019, at 10:14 AM, Kojo <rb...@gmail.com> wrote:
> >
> > Erick,
> > I am starting to think that my setup has more than one problem.
> > As I said before, I am not balancing my load to Solr nodes, and I have
> > eight nodes. All of my web application requests go to one Solr node, the
> > only one that dies. If I distribute the load across the other nodes, is
> it
> > possible that these problems may end?
> >
> > Even if I downsize the Solr cloud setup to 2 boxes 2 nodes each with less
> > shards than the 16 shards that I have now, I would like to know your
> > oppinion about the question above.
> >
> > Thank you,
> > Koji
> >
> >
> >
> >
> > Em qua, 14 de ago de 2019 às 14:15, Erick Erickson <
> erickerickson@gmail.com>
> > escreveu:
> >
> >> Kojo:
> >>
> >> On the surface, this is a reasonable configuration. Note that you may
> >> still want to decrease the Java heap, but only if you have enough “head
> >> room” for memory spikes.
> >>
> >> How do you know if you have “head room”? Unfortunately the only good
> >> answer is “you have to test”. You can look at the GC logs to see what
> your
> >> maximum heap requirements are, then add “some extra”.
> >>
> >> Note that there’s a balance here. Let’s say you can run successfully
> with
> >> X heap, so you allocate X + 0.1X to the heap. You can wind up spending a
> >> large amount of time in garbage collection. I.e. GC kicks in and
> recovers
> >> _just_ enough memory to continue for a very short while, then goes into
> >> another GC cycle. You don’t hit OOMs, but your system is slow.
> >>
> >> OTOH, let’s say you need X and allocate 3X. Garbage will accumulate and
> >> full GCs are rarer, but when they occur they take longer.
> >>
> >> And the G1GC collector is the current preference
> >>
> >> As I said, testing is really the only way to determine what the magic
> >> number is.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Aug 14, 2019, at 9:20 AM, Kojo <rb...@gmail.com> wrote:
> >>>
> >>> Shawn,
> >>>
> >>> Only my web application access this solr. at a first look at http
> server
> >>> logs I didnt find something different.  Sometimes I have a very big
> >> crawler
> >>> access to my servers, this was my first bet.
> >>>
> >>> No scheduled crons running at this time too.
> >>>
> >>> I think that I will reconfigure my boxes with two solr nodes each
> instead
> >>> of four and increase heap to 16GB. This box only run Solr and has 64Gb.
> >>> Each Solr will use 16Gb and the box will still have 32Gb for the OS.
> What
> >>> do you think?
> >>>
> >>> This is a production server, so I will plan to migrate.
> >>>
> >>> Regards,
> >>> Koji
> >>>
> >>>
> >>> Em ter, 13 de ago de 2019 às 12:58, Shawn Heisey <ap...@elyograg.org>
> >>> escreveu:
> >>>
> >>>> On 8/13/2019 9:28 AM, Kojo wrote:
> >>>>> Here are the last two gc logs:
> >>>>>
> >>>>>
> >>>>
> >>
> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ
> >>>>
> >>>> Thank you for that.
> >>>>
> >>>> Analyzing the 20MB gc log actually looks like a pretty healthy system.
> >>>> That log covers 58 hours of runtime, and everything looks very good to
> >> me.
> >>>>
> >>>> https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0
> >>>>
> >>>> But the small log shows a different story.  That log only covers a
> >>>> little more than four minutes.
> >>>>
> >>>> https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0
> >>>>
> >>>> What happened at approximately 10:55:15 PM on the day that the smaller
> >>>> log was produced?  Whatever happened caused Solr's heap usage to
> >>>> skyrocket and require more than 6GB.
> >>>>
> >>>> Thanks,
> >>>> Shawn
> >>>>
> >>
> >>
>
>

Re: Solr cloud questions

Posted by Erick Erickson <er...@gmail.com>.

OK, if you’re sending HTTP requests to a single node, that’s
something of an anti-pattern unless it’s a load balancer that
sends request to random nodes in the cluster. Do note that
even if you do send all http requests to one node, the top-level
request will be forwarded to other nodes in the cluster.

But if your single node dies, then indeed there’s no way for Solr
to get the request in the other nodes.

If you use SolrJ, in particular CloudSolrClient, it’s ZooKeeper-aware
and will both avoid dead nodes _and_ distribute the top-level
queries to all the Solr nodes. It’ll also be informed when a dead
nodes comes back and put it back into the rotation.

Best,
Erick

> On Aug 15, 2019, at 10:14 AM, Kojo <rb...@gmail.com> wrote:
> 
> Erick,
> I am starting to think that my setup has more than one problem.
> As I said before, I am not balancing my load to Solr nodes, and I have
> eight nodes. All of my web application requests go to one Solr node, the
> only one that dies. If I distribute the load across the other nodes, is it
> possible that these problems may end?
> 
> Even if I downsize the Solr cloud setup to 2 boxes 2 nodes each with less
> shards than the 16 shards that I have now, I would like to know your
> oppinion about the question above.
> 
> Thank you,
> Koji
> 
> 
> 
> 
> Em qua, 14 de ago de 2019 às 14:15, Erick Erickson <er...@gmail.com>
> escreveu:
> 
>> Kojo:
>> 
>> On the surface, this is a reasonable configuration. Note that you may
>> still want to decrease the Java heap, but only if you have enough “head
>> room” for memory spikes.
>> 
>> How do you know if you have “head room”? Unfortunately the only good
>> answer is “you have to test”. You can look at the GC logs to see what your
>> maximum heap requirements are, then add “some extra”.
>> 
>> Note that there’s a balance here. Let’s say you can run successfully with
>> X heap, so you allocate X + 0.1X to the heap. You can wind up spending a
>> large amount of time in garbage collection. I.e. GC kicks in and recovers
>> _just_ enough memory to continue for a very short while, then goes into
>> another GC cycle. You don’t hit OOMs, but your system is slow.
>> 
>> OTOH, let’s say you need X and allocate 3X. Garbage will accumulate and
>> full GCs are rarer, but when they occur they take longer.
>> 
>> And the G1GC collector is the current preference
>> 
>> As I said, testing is really the only way to determine what the magic
>> number is.
>> 
>> Best,
>> Erick
>> 
>>> On Aug 14, 2019, at 9:20 AM, Kojo <rb...@gmail.com> wrote:
>>> 
>>> Shawn,
>>> 
>>> Only my web application access this solr. at a first look at http server
>>> logs I didnt find something different.  Sometimes I have a very big
>> crawler
>>> access to my servers, this was my first bet.
>>> 
>>> No scheduled crons running at this time too.
>>> 
>>> I think that I will reconfigure my boxes with two solr nodes each instead
>>> of four and increase heap to 16GB. This box only run Solr and has 64Gb.
>>> Each Solr will use 16Gb and the box will still have 32Gb for the OS. What
>>> do you think?
>>> 
>>> This is a production server, so I will plan to migrate.
>>> 
>>> Regards,
>>> Koji
>>> 
>>> 
>>> Em ter, 13 de ago de 2019 às 12:58, Shawn Heisey <ap...@elyograg.org>
>>> escreveu:
>>> 
>>>> On 8/13/2019 9:28 AM, Kojo wrote:
>>>>> Here are the last two gc logs:
>>>>> 
>>>>> 
>>>> 
>> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ
>>>> 
>>>> Thank you for that.
>>>> 
>>>> Analyzing the 20MB gc log actually looks like a pretty healthy system.
>>>> That log covers 58 hours of runtime, and everything looks very good to
>> me.
>>>> 
>>>> https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0
>>>> 
>>>> But the small log shows a different story.  That log only covers a
>>>> little more than four minutes.
>>>> 
>>>> https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0
>>>> 
>>>> What happened at approximately 10:55:15 PM on the day that the smaller
>>>> log was produced?  Whatever happened caused Solr's heap usage to
>>>> skyrocket and require more than 6GB.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>> 
>>

Re: Solr cloud questions

Posted by Kojo <rb...@gmail.com>.

Erick,
I am starting to think that my setup has more than one problem.
As I said before, I am not balancing my load to Solr nodes, and I have
eight nodes. All of my web application requests go to one Solr node, the
only one that dies. If I distribute the load across the other nodes, is it
possible that these problems may end?

Even if I downsize the Solr cloud setup to 2 boxes 2 nodes each with less
shards than the 16 shards that I have now, I would like to know your
oppinion about the question above.

Thank you,
Koji




Em qua, 14 de ago de 2019 às 14:15, Erick Erickson <er...@gmail.com>
escreveu:

> Kojo:
>
> On the surface, this is a reasonable configuration. Note that you may
> still want to decrease the Java heap, but only if you have enough “head
> room” for memory spikes.
>
> How do you know if you have “head room”? Unfortunately the only good
> answer is “you have to test”. You can look at the GC logs to see what your
> maximum heap requirements are, then add “some extra”.
>
> Note that there’s a balance here. Let’s say you can run successfully with
> X heap, so you allocate X + 0.1X to the heap. You can wind up spending a
> large amount of time in garbage collection. I.e. GC kicks in and recovers
> _just_ enough memory to continue for a very short while, then goes into
> another GC cycle. You don’t hit OOMs, but your system is slow.
>
> OTOH, let’s say you need X and allocate 3X. Garbage will accumulate and
> full GCs are rarer, but when they occur they take longer.
>
> And the G1GC collector is the current preference
>
> As I said, testing is really the only way to determine what the magic
> number is.
>
> Best,
> Erick
>
> > On Aug 14, 2019, at 9:20 AM, Kojo <rb...@gmail.com> wrote:
> >
> > Shawn,
> >
> > Only my web application access this solr. at a first look at http server
> > logs I didnt find something different.  Sometimes I have a very big
> crawler
> > access to my servers, this was my first bet.
> >
> > No scheduled crons running at this time too.
> >
> > I think that I will reconfigure my boxes with two solr nodes each instead
> > of four and increase heap to 16GB. This box only run Solr and has 64Gb.
> > Each Solr will use 16Gb and the box will still have 32Gb for the OS. What
> > do you think?
> >
> > This is a production server, so I will plan to migrate.
> >
> > Regards,
> > Koji
> >
> >
> > Em ter, 13 de ago de 2019 às 12:58, Shawn Heisey <ap...@elyograg.org>
> > escreveu:
> >
> >> On 8/13/2019 9:28 AM, Kojo wrote:
> >>> Here are the last two gc logs:
> >>>
> >>>
> >>
> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ
> >>
> >> Thank you for that.
> >>
> >> Analyzing the 20MB gc log actually looks like a pretty healthy system.
> >> That log covers 58 hours of runtime, and everything looks very good to
> me.
> >>
> >> https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0
> >>
> >> But the small log shows a different story.  That log only covers a
> >> little more than four minutes.
> >>
> >> https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0
> >>
> >> What happened at approximately 10:55:15 PM on the day that the smaller
> >> log was produced?  Whatever happened caused Solr's heap usage to
> >> skyrocket and require more than 6GB.
> >>
> >> Thanks,
> >> Shawn
> >>
>
>

Re: Solr cloud questions

Posted by Erick Erickson <er...@gmail.com>.

Kojo:

On the surface, this is a reasonable configuration. Note that you may still want to decrease the Java heap, but only if you have enough “head room” for memory spikes.

How do you know if you have “head room”? Unfortunately the only good answer is “you have to test”. You can look at the GC logs to see what your maximum heap requirements are, then add “some extra”.

Note that there’s a balance here. Let’s say you can run successfully with X heap, so you allocate X + 0.1X to the heap. You can wind up spending a large amount of time in garbage collection. I.e. GC kicks in and recovers _just_ enough memory to continue for a very short while, then goes into another GC cycle. You don’t hit OOMs, but your system is slow.

OTOH, let’s say you need X and allocate 3X. Garbage will accumulate and full GCs are rarer, but when they occur they take longer.

And the G1GC collector is the current preference

As I said, testing is really the only way to determine what the magic number is.

Best,
Erick

> On Aug 14, 2019, at 9:20 AM, Kojo <rb...@gmail.com> wrote:
> 
> Shawn,
> 
> Only my web application access this solr. at a first look at http server
> logs I didnt find something different.  Sometimes I have a very big crawler
> access to my servers, this was my first bet.
> 
> No scheduled crons running at this time too.
> 
> I think that I will reconfigure my boxes with two solr nodes each instead
> of four and increase heap to 16GB. This box only run Solr and has 64Gb.
> Each Solr will use 16Gb and the box will still have 32Gb for the OS. What
> do you think?
> 
> This is a production server, so I will plan to migrate.
> 
> Regards,
> Koji
> 
> 
> Em ter, 13 de ago de 2019 às 12:58, Shawn Heisey <ap...@elyograg.org>
> escreveu:
> 
>> On 8/13/2019 9:28 AM, Kojo wrote:
>>> Here are the last two gc logs:
>>> 
>>> 
>> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ
>> 
>> Thank you for that.
>> 
>> Analyzing the 20MB gc log actually looks like a pretty healthy system.
>> That log covers 58 hours of runtime, and everything looks very good to me.
>> 
>> https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0
>> 
>> But the small log shows a different story.  That log only covers a
>> little more than four minutes.
>> 
>> https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0
>> 
>> What happened at approximately 10:55:15 PM on the day that the smaller
>> log was produced?  Whatever happened caused Solr's heap usage to
>> skyrocket and require more than 6GB.
>> 
>> Thanks,
>> Shawn
>>

Re: Solr cloud questions

Posted by Kojo <rb...@gmail.com>.

Ere,
thanks for the advice. I don´t have this specific use case, but I am doing
some operations that I think could be risky, due to the first time I am
using.

There is a page that groups by one specific attribute of documents
distributed accros shards. I am using Composite ID to allow grouping
correctly, but I don´t know the performance of this task. This page groups
and lists this attributes like "snippets". And it is allowed to page.

I am doing some graph queries too, using streaming.  As far as I observe,
this features are not causing the problem I described.

Thank you,
Koji






Em sex, 16 de ago de 2019 às 04:34, Ere Maijala <er...@helsinki.fi>
escreveu:

> Does your web application, by any chance, allow deep paging or something
> like that which requires returning rows at the end of a large result
> set? Something like a query where you could have parameters like
> &rows=10&start=1000000 ? That can easily cause OOM with Solr when using
> a sharded index. It would typically require a large number of rows to be
> returned and combined from all shards just to get the few rows to return
> in the correct order.
>
> For the above example with 8 shards, Solr would have to fetch 1 000 010
> rows from each shard. That's over 8 million rows! Even if it's just
> identifiers, that's a lot of memory required for an operation that seems
> so simple from the surface.
>
> If this is the case, you'll need to prevent the web application from
> issuing such queries. This may mean something like supporting paging
> only among the first 10 000 results. Typical requirement may also be to
> be able to see the last results of a query, but this can be accomplished
> by allowing sorting in both ascending and descending order.
>
> Regards,
> Ere
>
> Kojo kirjoitti 14.8.2019 klo 16.20:
> > Shawn,
> >
> > Only my web application access this solr. at a first look at http server
> > logs I didnt find something different.  Sometimes I have a very big
> crawler
> > access to my servers, this was my first bet.
> >
> > No scheduled crons running at this time too.
> >
> > I think that I will reconfigure my boxes with two solr nodes each instead
> > of four and increase heap to 16GB. This box only run Solr and has 64Gb.
> > Each Solr will use 16Gb and the box will still have 32Gb for the OS. What
> > do you think?
> >
> > This is a production server, so I will plan to migrate.
> >
> > Regards,
> > Koji
> >
> >
> > Em ter, 13 de ago de 2019 às 12:58, Shawn Heisey <ap...@elyograg.org>
> > escreveu:
> >
> >> On 8/13/2019 9:28 AM, Kojo wrote:
> >>> Here are the last two gc logs:
> >>>
> >>>
> >>
> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ
> >>
> >> Thank you for that.
> >>
> >> Analyzing the 20MB gc log actually looks like a pretty healthy system.
> >> That log covers 58 hours of runtime, and everything looks very good to
> me.
> >>
> >> https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0
> >>
> >> But the small log shows a different story.  That log only covers a
> >> little more than four minutes.
> >>
> >> https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0
> >>
> >> What happened at approximately 10:55:15 PM on the day that the smaller
> >> log was produced?  Whatever happened caused Solr's heap usage to
> >> skyrocket and require more than 6GB.
> >>
> >> Thanks,
> >> Shawn
> >>
> >
>
> --
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland
>

Re: Solr cloud questions

Posted by Ere Maijala <er...@helsinki.fi>.

Does your web application, by any chance, allow deep paging or something
like that which requires returning rows at the end of a large result
set? Something like a query where you could have parameters like
&rows=10&start=1000000 ? That can easily cause OOM with Solr when using
a sharded index. It would typically require a large number of rows to be
returned and combined from all shards just to get the few rows to return
in the correct order.

For the above example with 8 shards, Solr would have to fetch 1 000 010
rows from each shard. That's over 8 million rows! Even if it's just
identifiers, that's a lot of memory required for an operation that seems
so simple from the surface.

If this is the case, you'll need to prevent the web application from
issuing such queries. This may mean something like supporting paging
only among the first 10 000 results. Typical requirement may also be to
be able to see the last results of a query, but this can be accomplished
by allowing sorting in both ascending and descending order.

Regards,
Ere

Kojo kirjoitti 14.8.2019 klo 16.20:
> Shawn,
> 
> Only my web application access this solr. at a first look at http server
> logs I didnt find something different.  Sometimes I have a very big crawler
> access to my servers, this was my first bet.
> 
> No scheduled crons running at this time too.
> 
> I think that I will reconfigure my boxes with two solr nodes each instead
> of four and increase heap to 16GB. This box only run Solr and has 64Gb.
> Each Solr will use 16Gb and the box will still have 32Gb for the OS. What
> do you think?
> 
> This is a production server, so I will plan to migrate.
> 
> Regards,
> Koji
> 
> 
> Em ter, 13 de ago de 2019 às 12:58, Shawn Heisey <ap...@elyograg.org>
> escreveu:
> 
>> On 8/13/2019 9:28 AM, Kojo wrote:
>>> Here are the last two gc logs:
>>>
>>>
>> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ
>>
>> Thank you for that.
>>
>> Analyzing the 20MB gc log actually looks like a pretty healthy system.
>> That log covers 58 hours of runtime, and everything looks very good to me.
>>
>> https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0
>>
>> But the small log shows a different story.  That log only covers a
>> little more than four minutes.
>>
>> https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0
>>
>> What happened at approximately 10:55:15 PM on the day that the smaller
>> log was produced?  Whatever happened caused Solr's heap usage to
>> skyrocket and require more than 6GB.
>>
>> Thanks,
>> Shawn
>>
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Re: Solr cloud questions

Posted by Kojo <rb...@gmail.com>.

Shawn,

Only my web application access this solr. at a first look at http server
logs I didnt find something different.  Sometimes I have a very big crawler
access to my servers, this was my first bet.

No scheduled crons running at this time too.

I think that I will reconfigure my boxes with two solr nodes each instead
of four and increase heap to 16GB. This box only run Solr and has 64Gb.
Each Solr will use 16Gb and the box will still have 32Gb for the OS. What
do you think?

This is a production server, so I will plan to migrate.

Regards,
Koji

Em ter, 13 de ago de 2019 às 12:58, Shawn Heisey <ap...@elyograg.org>
escreveu:

> On 8/13/2019 9:28 AM, Kojo wrote:
> > Here are the last two gc logs:
> >
> >
> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ
>
> Thank you for that.
>
> Analyzing the 20MB gc log actually looks like a pretty healthy system.
> That log covers 58 hours of runtime, and everything looks very good to me.
>
> https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0
>
> But the small log shows a different story.  That log only covers a
> little more than four minutes.
>
> https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0
>
> What happened at approximately 10:55:15 PM on the day that the smaller
> log was produced?  Whatever happened caused Solr's heap usage to
> skyrocket and require more than 6GB.
>
> Thanks,
> Shawn
>

Re: Solr cloud questions

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/13/2019 9:28 AM, Kojo wrote:
> Here are the last two gc logs:
> 
> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ

Thank you for that.

Analyzing the 20MB gc log actually looks like a pretty healthy system. 
That log covers 58 hours of runtime, and everything looks very good to me.

https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0

But the small log shows a different story.  That log only covers a 
little more than four minutes.

https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0

What happened at approximately 10:55:15 PM on the day that the smaller 
log was produced?  Whatever happened caused Solr's heap usage to 
skyrocket and require more than 6GB.

Thanks,
Shawn

Re: Solr cloud questions

Posted by Kojo <rb...@gmail.com>.

Shawn,
Here are the last two gc logs:

https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ


Thank you,
Koji


Em ter, 13 de ago de 2019 às 09:33, Shawn Heisey <ap...@elyograg.org>
escreveu:

> On 8/13/2019 6:19 AM, Kojo wrote:
> > --------------
> > tail -f  node1/logs/solr_oom_killer-8983-2019-08-11_22_57_56.log
> > Running OOM killer script for process 38788 for Solr on port 8983
> > Killed process 38788
> > --------------
>
> Based on what I can see, a 6GB heap is not big enough for the setup
> you've got.  There are two ways to deal with an OOME problem.  1)
> Increase the resource that was depleted.  2) Change the configuration so
> the program needs less of that resource.
>
>
> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-JavaHeap
>
> > tail -50   node1/logs/archived/solr_gc.log.4.current
>
> To be useful, we will need the entire GC log, not a 50 line subset.  In
> the subset, I can see that there was a full GC that did absolutely
> nothing -- no memory was freed.  This is evidence that your heap is too
> small.  You will need to use a file shariog site and provide a URL for
> the entire GC log - email attachments rarely make it to the list.  The
> bigger the log is, the better idea we can get about what heap size you
> need.
>
> Thanks,
> Shawn
>

Re: Solr cloud questions

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/13/2019 6:19 AM, Kojo wrote:
> --------------
> tail -f  node1/logs/solr_oom_killer-8983-2019-08-11_22_57_56.log
> Running OOM killer script for process 38788 for Solr on port 8983
> Killed process 38788
> --------------

Based on what I can see, a 6GB heap is not big enough for the setup 
you've got.  There are two ways to deal with an OOME problem.  1) 
Increase the resource that was depleted.  2) Change the configuration so 
the program needs less of that resource.

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-JavaHeap

> tail -50   node1/logs/archived/solr_gc.log.4.current

To be useful, we will need the entire GC log, not a 50 line subset.  In 
the subset, I can see that there was a full GC that did absolutely 
nothing -- no memory was freed.  This is evidence that your heap is too 
small.  You will need to use a file shariog site and provide a URL for 
the entire GC log - email attachments rarely make it to the list.  The 
bigger the log is, the better idea we can get about what heap size you need.

Thanks,
Shawn

Re: Solr cloud questions

Posted by Kojo <rb...@gmail.com>.

Erick and Shawn,
thank you very much for the very usefull information.

When I start to move from sigle Solr to cloud, I was planning to use the
cluster for very large collections.

But the collection that I said, will not grow that much, so I will downsize
shards.


Thanks for the information about load balancing. I will provide it.



Shawn, bellow I share the information that I hope will clarify.

Linux CentOS
Solr 6.6
64 Gb each box
6 Gb each node

The last time the node died was on 2019-08-11. It happens sometimes a week.


--------------
tail -f  node1/logs/solr_oom_killer-8983-2019-08-11_22_57_56.log
Running OOM killer script for process 38788 for Solr on port 8983
Killed process 38788
--------------


--------------
ls -ltr  node1/logs/archived/
total 82032
-rw-rw-r-- 1 solr solr 20973030 Aug  4 18:31 solr_gc.log.0
-rw-rw-r-- 1 solr solr 20973415 Aug  6 21:05 solr_gc.log.1
-rw-rw-r-- 1 solr solr 20971714 Aug  9 12:01 solr_gc.log.2
-rw-rw-r-- 1 solr solr 20971720 Aug 11 22:53 solr_gc.log.3
-rw-rw-r-- 1 solr solr    77096 Aug 11 22:57 solr_gc.log.4.current
-rw-rw-r-- 1 solr solr      364 Aug 11 22:57 solr-8983-console.log
--------------


--------------
tail -50   node1/logs/archived/solr_gc.log.4.current
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
}
2019-08-11T22:57:39.231-0300: 802516.887: Total time for which application
threads were stopped: 12.5386815 seconds, Stopping threads took: 0.0001242
seconds
{Heap before GC invocations=34291 (full 252):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K, 100% used [0x0000000640000000, 0x0000000680000000,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069ffffff8,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718592K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
2019-08-11T22:57:39.233-0300: 802516.889: [Full GC (Allocation Failure)
2019-08-11T22:57:39.233-0300: 802516.889: [CMS:
4718592K->4718591K(4718592K), 5.5779385 secs] 6029311K->6029311K(6029312K),
[Metaspace: 50496K->50496K(1097728K)], 5.5780863 secs] [Times: user=5.58
sys=0.00, real=5.58 secs]
Heap after GC invocations=34292 (full 253):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K,  99% used [0x0000000640000000, 0x000000067fffff68,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069fffff18,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718591K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
}
2019-08-11T22:57:44.812-0300: 802522.469: Total time for which application
threads were stopped: 5.5805500 seconds, Stopping threads took: 0.0001295
seconds
{Heap before GC invocations=34292 (full 253):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K, 100% used [0x0000000640000000, 0x0000000680000000,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069fffff98,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718591K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
2019-08-11T22:57:44.813-0300: 802522.470: [Full GC (Allocation Failure)
2019-08-11T22:57:44.813-0300: 802522.470: [CMS:
4718591K->4718591K(4718592K), 5.5944800 secs] 6029311K->6029311K(6029312K),
[Metaspace: 50496K->50496K(1097728K)], 5.5946363 secs] [Times: user=5.60
sys=0.00, real=5.59 secs]
Heap after GC invocations=34293 (full 254):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K,  99% used [0x0000000640000000, 0x000000067fffffe8,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069fffff98,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718591K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
}
{Heap before GC invocations=34293 (full 254):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K,  99% used [0x0000000640000000, 0x000000067fffffe8,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069fffff98,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718591K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
2019-08-11T22:57:50.408-0300: 802528.065: [Full GC (Allocation Failure)
2019-08-11T22:57:50.408-0300: 802528.065: [CMS:
4718591K->4718591K(4718592K), 5.5953203 secs] 6029311K->6029311K(6029312K),
[Metaspace: 50496K->50496K(1097728K)], 5.5954659 secs] [Times: user=5.60
sys=0.00, real=5.60 secs]
--------------

Em seg, 12 de ago de 2019 às 13:26, Shawn Heisey <ap...@elyograg.org>
escreveu:

> On 8/12/2019 5:47 AM, Kojo wrote:
> > I am using Solr cloud on this configuration:
> >
> > 2 boxes (one Solr in each box)
> > 4 instances per box
>
> Why are you running multiple instances on one server?  For most setups,
> this has too much overhead.  A single instance can handle many indexes.
> The only good reason I can think of to run multiple instances is when
> the amount of heap memory needed exceeds 31GB.  And even then, four
> instances seems excessive.  If you only have 300000 documents, there
> should be no reason for a super large heap.
>
> > At this moment I have an active collections with about 300.000 docs. The
> > other collections are not being queried. The acctive collection is
> > configured:
> > - shards: 16
> > - replication factor: 2
> >
> > These two Solrs (Solr1 and Solr2) use Zookeper (one box, one instance. No
> > zookeeper cluster)
> >
> > My application point to Solr1, and everything works fine, until suddenly
> on
> > instance of this Solr1 dies. This istance is on port 8983, the "main"
> > instance. I thought it could be related to memory usage, but we increase
> > RAM and JVM memory but it still dies.
> > The Solr1, the one wich dies,is the destination where I point my web
> > application.
>
> You will have to check the logs.  If Solr is not running on Windows,
> then any OutOfMemoryError exception, which can be caused by things other
> than a memory shortage, will result in Solr terminating itself.  On
> Windows, that functionality does not yet exist, so it would have to be
> Java or the OS that kills it.
>
> > Here I have two questions that I hope you can help me:
> >
> > 1. Which log can I look for debug this issue?
>
> Assuming you're NOT on Windows, check to see if there is a logfile named
> solr_oom_killer-8983.log in the logs directory where solr.log lives.  If
> there is, then that means the oom killer script was executed, and that
> happens when there is an OutOfMemoryError thrown.  The solr.log file
> MIGHT contain the OOME exception which will tell you what system
> resource was depleted.  If it was not heap memory that was depleted,
> then increasing memory probably won't help.
>
> If you share the gc log that Solr writes, we can analyze this to see if
> it was heap memory that was depleted.
>
> > 2. After this instance dies, the Solr cloud does not answer to my web
> > application. Is this correct? I thougth that the replicas should answer
> if
> > one shard, instance or one box goes down.
>
> If a Solr instance dies, you can't make connections directly to it.
> Connections would need to go to another instance.  You need a load
> balancer to handle that automatically, or a cloud-aware client.  The
> only cloud-aware client that I am sure about is the one for Java -- it
> is named SolrJ, created by the Solr project and distributed with Solr.
> I think that a third party MIGHT have written a cloud-aware client for
> Python, but I am not sure about this.
>
> If you set up a load balancer, you will need to handle redundancy for that.
>
> Side note:  A fully redundant zookeeper install needs three servers.  Do
> not put a load balancer in front of zookeeper.  The ZK protocol handles
> redundancy itself and a load balancer will break that.
>
> Thanks.
> Shawn
>

Re: Solr cloud questions

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/12/2019 5:47 AM, Kojo wrote:
> I am using Solr cloud on this configuration:
> 
> 2 boxes (one Solr in each box)
> 4 instances per box

Why are you running multiple instances on one server?  For most setups, 
this has too much overhead.  A single instance can handle many indexes. 
The only good reason I can think of to run multiple instances is when 
the amount of heap memory needed exceeds 31GB.  And even then, four 
instances seems excessive.  If you only have 300000 documents, there 
should be no reason for a super large heap.

> At this moment I have an active collections with about 300.000 docs. The
> other collections are not being queried. The acctive collection is
> configured:
> - shards: 16
> - replication factor: 2
> 
> These two Solrs (Solr1 and Solr2) use Zookeper (one box, one instance. No
> zookeeper cluster)
> 
> My application point to Solr1, and everything works fine, until suddenly on
> instance of this Solr1 dies. This istance is on port 8983, the "main"
> instance. I thought it could be related to memory usage, but we increase
> RAM and JVM memory but it still dies.
> The Solr1, the one wich dies,is the destination where I point my web
> application.

You will have to check the logs.  If Solr is not running on Windows, 
then any OutOfMemoryError exception, which can be caused by things other 
than a memory shortage, will result in Solr terminating itself.  On 
Windows, that functionality does not yet exist, so it would have to be 
Java or the OS that kills it.

> Here I have two questions that I hope you can help me:
> 
> 1. Which log can I look for debug this issue?

Assuming you're NOT on Windows, check to see if there is a logfile named 
solr_oom_killer-8983.log in the logs directory where solr.log lives.  If 
there is, then that means the oom killer script was executed, and that 
happens when there is an OutOfMemoryError thrown.  The solr.log file 
MIGHT contain the OOME exception which will tell you what system 
resource was depleted.  If it was not heap memory that was depleted, 
then increasing memory probably won't help.

If you share the gc log that Solr writes, we can analyze this to see if 
it was heap memory that was depleted.

> 2. After this instance dies, the Solr cloud does not answer to my web
> application. Is this correct? I thougth that the replicas should answer if
> one shard, instance or one box goes down.

If a Solr instance dies, you can't make connections directly to it. 
Connections would need to go to another instance.  You need a load 
balancer to handle that automatically, or a cloud-aware client.  The 
only cloud-aware client that I am sure about is the one for Java -- it 
is named SolrJ, created by the Solr project and distributed with Solr. 
I think that a third party MIGHT have written a cloud-aware client for 
Python, but I am not sure about this.

If you set up a load balancer, you will need to handle redundancy for that.

Side note:  A fully redundant zookeeper install needs three servers.  Do 
not put a load balancer in front of zookeeper.  The ZK protocol handles 
redundancy itself and a load balancer will break that.

Thanks.
Shawn

Re: Solr cloud questions

Posted by Erick Erickson <er...@gmail.com>.

Kojo:

The solr logs should give you a much better idea of what the triggering event was.

Just increasing the heap doesn’t guarantee much, again the Solr logs will report the OOM exception if it’s memory-related. You haven’t told us what your physical RAM is nor how much you’re allocating to heap, those would be helpful.

As far as Solr not answering, It Depends (tm). How are you querying Solr? If it’s just using an HTTP request to the node that died, there’s no communication possible, the http end-point is down. If you’re using SolrJ or load balancer in front, then it should, indeed get to the live Solr and you should get a reply.

I’ll add that from what you report, this system seems massivley over-sharded. I generally start my testing with the assumption that I can fit 50,000,000 documents per shard on a decent-sized box. So unless this configuration is for massive planned growth, the number of shards you have is far in excess of what you need. This isn’t the root cause of your problem, but it doesn’t help either….

Best,
Erick

> On Aug 12, 2019, at 7:47 AM, Kojo <rb...@gmail.com> wrote:
> 
> Hi,
> I am using Solr cloud on this configuration:
> 
> 2 boxes (one Solr in each box)
> 4 instances per box
> 
> At this moment I have an active collections with about 300.000 docs. The
> other collections are not being queried. The acctive collection is
> configured:
> - shards: 16
> - replication factor: 2
> 
> These two Solrs (Solr1 and Solr2) use Zookeper (one box, one instance. No
> zookeeper cluster)
> 
> My application point to Solr1, and everything works fine, until suddenly on
> instance of this Solr1 dies. This istance is on port 8983, the "main"
> instance. I thought it could be related to memory usage, but we increase
> RAM and JVM memory but it still dies.
> The Solr1, the one wich dies,is the destination where I point my web
> application.
> 
> Here I have two questions that I hope you can help me:
> 
> 1. Which log can I look for debug this issue?
> 2. After this instance dies, the Solr cloud does not answer to my web
> application. Is this correct? I thougth that the replicas should answer if
> one shard, instance or one box goes down.
> 
> Regards,
> Koji