You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jihwan Kim <ji...@gmail.com> on 2016/10/20 14:44:31 UTC

Memory Issue with SnapPuller

Hi,
We are using Solr 4.10.4 and experiencing out of memory exception.  It
seems the problem is cause by the following code & scenario.

This is the last part of a fetchLastIndex method in SnapPuller.java

        // we must reload the core after we open the IW back up
        if (reloadCore) {
          reloadCore();
        }

        if (successfulInstall) {
          if (isFullCopyNeeded) {
            // let the system know we are changing dir's and the old one
            // may be closed
            if (indexDir != null) {
              LOG.info("removing old index directory " + indexDir);
              core.getDirectoryFactory().doneWithDirectory(indexDir);
              core.getDirectoryFactory().remove(indexDir);
            }
          }
          if (isFullCopyNeeded) {
            solrCore.getUpdateHandler().newIndexWriter(isFullCopyNeeded);
          }

          openNewSearcherAndUpdateCommitPoint(isFullCopyNeeded);
        }

Inside the reloadCore, it create a new core, register it, and try to close
the current/old core.  When the closing old core process goes normal, it
throws an exception "SnapPull failed :org.apache.solr.common.SolrException:
Index fetch failed Caused by java.lang.RuntimeException: Interrupted while
waiting for core reload to finish Caused by Caused by:
java.lang.InterruptedException."

Despite this exception, the process seems OK because it just terminate the
SnapPuller thread but all other threads that process the closing go well.

*Now, the problem is when the close() method called during the reloadCore
doesn't really close the core.*
This is the beginning of the close() method.
    public void close() {
        int count = refCount.decrementAndGet();
        if (count > 0) return; // close is called often, and only actually
closes if nothing is using it.
        if (count < 0) {
           log.error("Too many close [count:{}] on {}. Please report this
exception to solr-user@lucene.apache.org", count, this );
           assert false : "Too many closes on SolrCore";
           return;
        }
        log.info(logid+" CLOSING SolrCore " + this);

When a HTTP Request is executing, the refCount is greater than 1. So, when
the old core is trying to be closed during the core reload, the if (count >
0) condition simply return this method.

Then, fetchLastIndex method in SnapPuller processes next code and execute
"openNewSearcherAndUpdateCommitPoint".  If you look at this method, it
tries to open a new searcher of the solrCore which is referenced during the
SnapPuller constructor and I believe this one points to the old core.  At
certain timing, this method also throw
SnapPuller  - java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at ....SnapPuller.openNewSearcherAndUpdateCommitPoint(SnapPuller.java:680)

After this exception, things start to go bad.

*In summary, I have two questions.*
1. Can you confirm this memory / thread issue?
2. When the core reload happens successfully (no matter it throws the
exception or not), does Solr need to call the
openNewSearcherAndUpdateCommitPoint method?

Thanks.

Re: Memory Issue with SnapPuller

Posted by Jihwan Kim <ji...@gmail.com>.
Sorry, wrong button was clicked.

A little more about "At certain timing, this method also throw "
SnapPuller  - java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at ....SnapPuller.openNewSearcherAndUpdateCommitPoint(SnapPuller.java:680)

This is a scenario (with less confident).
The old core didn't complete the close() method because of the refCount
during the reloadCore.  Then, it execute the
openNewSearcherAndUpdateCommitPoint
method.  Now, a http request, as an example, finished a process and called
the SolrCore close() method.  refCount is 0. and go to all other process in
the close() method of the SolrCore.
In this case, the InterruptedException can be thrown in the
openNewSearcherAndUpdateCommitPoint.  After that, I noticed a one thread
that executes a newSearcher process hangs and high CPU usage remains high.
We are also using a larger external field file too.



On Thu, Oct 20, 2016 at 9:11 AM, Jihwan Kim <ji...@gmail.com> wrote:

> A little more about "At certain timing, this method also throw "
> SnapPuller  - java.lang.InterruptedException
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
> at java.util.concurrent.FutureTask.get(FutureTask.java:191)
> at ....SnapPuller.openNewSearcherAndUpdateCommitPoint(SnapPuller.java:680)
>
> This is less confident scenario.
> The old core didn't complete the close() method during the reloadCore.
> Then, it execute the openNewSearcherAndUpdateCommitPoint method.  Now, a
> http request, as an example, finished a process and called the SolrCore
> close() method.  refCount is 0. and go to all other process in the close()
>
>
> On Thu, Oct 20, 2016 at 8:44 AM, Jihwan Kim <ji...@gmail.com> wrote:
>
>> Hi,
>> We are using Solr 4.10.4 and experiencing out of memory exception.  It
>> seems the problem is cause by the following code & scenario.
>>
>> This is the last part of a fetchLastIndex method in SnapPuller.java
>>
>>         // we must reload the core after we open the IW back up
>>         if (reloadCore) {
>>           reloadCore();
>>         }
>>
>>         if (successfulInstall) {
>>           if (isFullCopyNeeded) {
>>             // let the system know we are changing dir's and the old one
>>             // may be closed
>>             if (indexDir != null) {
>>               LOG.info("removing old index directory " + indexDir);
>>               core.getDirectoryFactory().doneWithDirectory(indexDir);
>>               core.getDirectoryFactory().remove(indexDir);
>>             }
>>           }
>>           if (isFullCopyNeeded) {
>>             solrCore.getUpdateHandler().newIndexWriter(isFullCopyNeeded);
>>           }
>>
>>           openNewSearcherAndUpdateCommitPoint(isFullCopyNeeded);
>>         }
>>
>> Inside the reloadCore, it create a new core, register it, and try to
>> close the current/old core.  When the closing old core process goes normal,
>> it throws an exception "SnapPull failed :org.apache.solr.common.SolrException:
>> Index fetch failed Caused by java.lang.RuntimeException: Interrupted while
>> waiting for core reload to finish Caused by Caused by:
>> java.lang.InterruptedException."
>>
>> Despite this exception, the process seems OK because it just terminate
>> the SnapPuller thread but all other threads that process the closing go
>> well.
>>
>> *Now, the problem is when the close() method called during the reloadCore
>> doesn't really close the core.*
>> This is the beginning of the close() method.
>>     public void close() {
>>         int count = refCount.decrementAndGet();
>>         if (count > 0) return; // close is called often, and only
>> actually closes if nothing is using it.
>>         if (count < 0) {
>>            log.error("Too many close [count:{}] on {}. Please report this
>> exception to solr-user@lucene.apache.org", count, this );
>>            assert false : "Too many closes on SolrCore";
>>            return;
>>         }
>>         log.info(logid+" CLOSING SolrCore " + this);
>>
>> When a HTTP Request is executing, the refCount is greater than 1. So,
>> when the old core is trying to be closed during the core reload, the if
>> (count > 0) condition simply return this method.
>>
>> Then, fetchLastIndex method in SnapPuller processes next code and execute
>> "openNewSearcherAndUpdateCommitPoint".  If you look at this method, it
>> tries to open a new searcher of the solrCore which is referenced during the
>> SnapPuller constructor and I believe this one points to the old core.  At
>> certain timing, this method also throw
>> SnapPuller  - java.lang.InterruptedException
>> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:191)
>> at ....SnapPuller.openNewSearcherAndUpdateCommitPoint(
>> SnapPuller.java:680)
>>
>> After this exception, things start to go bad.
>>
>> *In summary, I have two questions.*
>> 1. Can you confirm this memory / thread issue?
>> 2. When the core reload happens successfully (no matter it throws the
>> exception or not), does Solr need to call the openNewSearcherAndUpdateCommitPoint
>> method?
>>
>> Thanks.
>>
>
>

Re: Memory Issue with SnapPuller

Posted by Jihwan Kim <ji...@gmail.com>.
A little more about "At certain timing, this method also throw "
SnapPuller  - java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at ....SnapPuller.openNewSearcherAndUpdateCommitPoint(SnapPuller.java:680)

This is less confident scenario.
The old core didn't complete the close() method during the reloadCore.
Then, it execute the openNewSearcherAndUpdateCommitPoint method.  Now, a
http request, as an example, finished a process and called the SolrCore
close() method.  refCount is 0. and go to all other process in the close()


On Thu, Oct 20, 2016 at 8:44 AM, Jihwan Kim <ji...@gmail.com> wrote:

> Hi,
> We are using Solr 4.10.4 and experiencing out of memory exception.  It
> seems the problem is cause by the following code & scenario.
>
> This is the last part of a fetchLastIndex method in SnapPuller.java
>
>         // we must reload the core after we open the IW back up
>         if (reloadCore) {
>           reloadCore();
>         }
>
>         if (successfulInstall) {
>           if (isFullCopyNeeded) {
>             // let the system know we are changing dir's and the old one
>             // may be closed
>             if (indexDir != null) {
>               LOG.info("removing old index directory " + indexDir);
>               core.getDirectoryFactory().doneWithDirectory(indexDir);
>               core.getDirectoryFactory().remove(indexDir);
>             }
>           }
>           if (isFullCopyNeeded) {
>             solrCore.getUpdateHandler().newIndexWriter(isFullCopyNeeded);
>           }
>
>           openNewSearcherAndUpdateCommitPoint(isFullCopyNeeded);
>         }
>
> Inside the reloadCore, it create a new core, register it, and try to close
> the current/old core.  When the closing old core process goes normal, it
> throws an exception "SnapPull failed :org.apache.solr.common.SolrException:
> Index fetch failed Caused by java.lang.RuntimeException: Interrupted while
> waiting for core reload to finish Caused by Caused by: java.lang.
> InterruptedException."
>
> Despite this exception, the process seems OK because it just terminate the
> SnapPuller thread but all other threads that process the closing go well.
>
> *Now, the problem is when the close() method called during the reloadCore
> doesn't really close the core.*
> This is the beginning of the close() method.
>     public void close() {
>         int count = refCount.decrementAndGet();
>         if (count > 0) return; // close is called often, and only actually
> closes if nothing is using it.
>         if (count < 0) {
>            log.error("Too many close [count:{}] on {}. Please report this
> exception to solr-user@lucene.apache.org", count, this );
>            assert false : "Too many closes on SolrCore";
>            return;
>         }
>         log.info(logid+" CLOSING SolrCore " + this);
>
> When a HTTP Request is executing, the refCount is greater than 1. So, when
> the old core is trying to be closed during the core reload, the if (count >
> 0) condition simply return this method.
>
> Then, fetchLastIndex method in SnapPuller processes next code and execute "
> openNewSearcherAndUpdateCommitPoint".  If you look at this method, it
> tries to open a new searcher of the solrCore which is referenced during the
> SnapPuller constructor and I believe this one points to the old core.  At
> certain timing, this method also throw
> SnapPuller  - java.lang.InterruptedException
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
> at java.util.concurrent.FutureTask.get(FutureTask.java:191)
> at ....SnapPuller.openNewSearcherAndUpdateCommitPoint(SnapPuller.java:680)
>
> After this exception, things start to go bad.
>
> *In summary, I have two questions.*
> 1. Can you confirm this memory / thread issue?
> 2. When the core reload happens successfully (no matter it throws the
> exception or not), does Solr need to call the openNewSearcherAndUpdateCommitPoint
> method?
>
> Thanks.
>

Re: Memory Issue with SnapPuller

Posted by Jihwan Kim <ji...@gmail.com>.
This is also the screenshot of jvisualvm.
This exception occurred at 2:55PM and 3:40PM and OOME occurs at 3:41PM.
SnapPuller  - java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at ....SnapPuller.openNewSearcherAndUpdateCommitPoint(SnapPuller.java:680)

On Thu, Oct 20, 2016 at 10:14 AM, Jihwan Kim <ji...@gmail.com> wrote:

> Good points.
> I am able to create this with periodic snap puller and only one http
> request.  When I load the Solr on tomcat, the initial memory usage was
> between 600M to 800 M.  First time, I used 1.5 G and then increased the
> heap to 3.5G.  (When I said 'triple', I meant comparing to the initial
> memory consumption)
>
> OK... Shall we focus on my second question: When the core reload happens
> successfully (no matter it throws the exception or not), does Solr need to
> call the openNewSearcherAndUpdateCommitPoint method?  I think this
> openNewSearcherAndUpdateCommitPoint method tries to open a new searcher
> on the old SolrCore.
>
> Thank you!
>
>
> On Thu, Oct 20, 2016 at 9:55 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> You say you tripled the memory. Up to what? Tripling from 500M t0 1.5G
>> isn't likely enough, tripling from 6G to 18G is something else
>> again....
>>
>> You can take a look through any of the memory profilers and try to
>> catch the objects (and where they're being allocated). The second is
>> to look at the stack trace (presuming you don't have an OOM killer
>> script running) and perhaps triangulate that way.
>>
>> Best,
>> Erick
>>
>> On Thu, Oct 20, 2016 at 11:44 AM, Jihwan Kim <ji...@gmail.com> wrote:
>> > Thank you Shawn. I understand the two options.
>> > After my own testing with a smaller heap, I increased my heap size more
>> than
>> > triple, but OOME happens again with my testing cases under the
>> controlled
>> > thread process. Increased heap size just delayed the OOME.
>> >
>> > Can you provide a feedback on my second question:  When the core reload
>> > happens successfully (no matter it throws the exception or not), does
>> Solr
>> > need to call the openNewSearcherAndUpdateCommitPoint method?
>> >
>> > As I described on my previous email, a thread created from
>> > openNewSearcherAndUpdateCommitPoint method hangs and cause a high CPU
>> usage
>> > and a slow response time.  Attached image is the thread hung.
>> >
>> >
>> >
>> > On Thu, Oct 20, 2016 at 9:29 AM, Shawn Heisey <ap...@elyograg.org>
>> wrote:
>> >>
>> >> On 10/20/2016 8:44 AM, Jihwan Kim wrote:
>> >> > We are using Solr 4.10.4 and experiencing out of memory exception. It
>> >> > seems the problem is cause by the following code & scenario.
>> >>
>> >> When you get an OutOfMemoryError exception that tells you there's not
>> >> enough heap space, the place where the exception happens is frequently
>> >> unrelated to the actual source of the problem.  Also, unless the
>> >> programmer engages in extraordinary effort, encountering OOME will
>> cause
>> >> program behavior to become completely unpredictable.  Most of Solr has
>> >> *NOT* had the benefit of extraordinary effort to handle OOME
>> gracefully.
>> >>
>> >> Before continuing with troubleshooting of SnapPuller, you're going to
>> >> need to fix the OOME error.  When you run out of memory, that is likely
>> >> to be the CAUSE of any errors you're seeing, not a symptom.
>> >>
>> >> There are exactly two ways to deal with OOME:  Increase the max heap,
>> or
>> >> take steps to reduce the amount of heap required.  Increasing the heap
>> >> is the easiest option, and typically the first step.  Sometimes it's
>> the
>> >> ONLY option.
>> >>
>> >> https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>> >>
>> >> Thanks,
>> >> Shawn
>> >>
>> >
>>
>
>

Re: Memory Issue with SnapPuller

Posted by Jihwan Kim <ji...@gmail.com>.
Good points.
I am able to create this with periodic snap puller and only one http
request.  When I load the Solr on tomcat, the initial memory usage was
between 600M to 800 M.  First time, I used 1.5 G and then increased the
heap to 3.5G.  (When I said 'triple', I meant comparing to the initial
memory consumption)

OK... Shall we focus on my second question: When the core reload happens
successfully (no matter it throws the exception or not), does Solr need to
call the openNewSearcherAndUpdateCommitPoint method?  I think this
openNewSearcherAndUpdateCommitPoint method tries to open a new searcher on
the old SolrCore.

Thank you!


On Thu, Oct 20, 2016 at 9:55 AM, Erick Erickson <er...@gmail.com>
wrote:

> You say you tripled the memory. Up to what? Tripling from 500M t0 1.5G
> isn't likely enough, tripling from 6G to 18G is something else
> again....
>
> You can take a look through any of the memory profilers and try to
> catch the objects (and where they're being allocated). The second is
> to look at the stack trace (presuming you don't have an OOM killer
> script running) and perhaps triangulate that way.
>
> Best,
> Erick
>
> On Thu, Oct 20, 2016 at 11:44 AM, Jihwan Kim <ji...@gmail.com> wrote:
> > Thank you Shawn. I understand the two options.
> > After my own testing with a smaller heap, I increased my heap size more
> than
> > triple, but OOME happens again with my testing cases under the controlled
> > thread process. Increased heap size just delayed the OOME.
> >
> > Can you provide a feedback on my second question:  When the core reload
> > happens successfully (no matter it throws the exception or not), does
> Solr
> > need to call the openNewSearcherAndUpdateCommitPoint method?
> >
> > As I described on my previous email, a thread created from
> > openNewSearcherAndUpdateCommitPoint method hangs and cause a high CPU
> usage
> > and a slow response time.  Attached image is the thread hung.
> >
> >
> >
> > On Thu, Oct 20, 2016 at 9:29 AM, Shawn Heisey <ap...@elyograg.org>
> wrote:
> >>
> >> On 10/20/2016 8:44 AM, Jihwan Kim wrote:
> >> > We are using Solr 4.10.4 and experiencing out of memory exception. It
> >> > seems the problem is cause by the following code & scenario.
> >>
> >> When you get an OutOfMemoryError exception that tells you there's not
> >> enough heap space, the place where the exception happens is frequently
> >> unrelated to the actual source of the problem.  Also, unless the
> >> programmer engages in extraordinary effort, encountering OOME will cause
> >> program behavior to become completely unpredictable.  Most of Solr has
> >> *NOT* had the benefit of extraordinary effort to handle OOME gracefully.
> >>
> >> Before continuing with troubleshooting of SnapPuller, you're going to
> >> need to fix the OOME error.  When you run out of memory, that is likely
> >> to be the CAUSE of any errors you're seeing, not a symptom.
> >>
> >> There are exactly two ways to deal with OOME:  Increase the max heap, or
> >> take steps to reduce the amount of heap required.  Increasing the heap
> >> is the easiest option, and typically the first step.  Sometimes it's the
> >> ONLY option.
> >>
> >> https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
> >>
> >> Thanks,
> >> Shawn
> >>
> >
>

Re: Memory Issue with SnapPuller

Posted by Erick Erickson <er...@gmail.com>.
You say you tripled the memory. Up to what? Tripling from 500M t0 1.5G
isn't likely enough, tripling from 6G to 18G is something else
again....

You can take a look through any of the memory profilers and try to
catch the objects (and where they're being allocated). The second is
to look at the stack trace (presuming you don't have an OOM killer
script running) and perhaps triangulate that way.

Best,
Erick

On Thu, Oct 20, 2016 at 11:44 AM, Jihwan Kim <ji...@gmail.com> wrote:
> Thank you Shawn. I understand the two options.
> After my own testing with a smaller heap, I increased my heap size more than
> triple, but OOME happens again with my testing cases under the controlled
> thread process. Increased heap size just delayed the OOME.
>
> Can you provide a feedback on my second question:  When the core reload
> happens successfully (no matter it throws the exception or not), does Solr
> need to call the openNewSearcherAndUpdateCommitPoint method?
>
> As I described on my previous email, a thread created from
> openNewSearcherAndUpdateCommitPoint method hangs and cause a high CPU usage
> and a slow response time.  Attached image is the thread hung.
>
>
>
> On Thu, Oct 20, 2016 at 9:29 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>>
>> On 10/20/2016 8:44 AM, Jihwan Kim wrote:
>> > We are using Solr 4.10.4 and experiencing out of memory exception. It
>> > seems the problem is cause by the following code & scenario.
>>
>> When you get an OutOfMemoryError exception that tells you there's not
>> enough heap space, the place where the exception happens is frequently
>> unrelated to the actual source of the problem.  Also, unless the
>> programmer engages in extraordinary effort, encountering OOME will cause
>> program behavior to become completely unpredictable.  Most of Solr has
>> *NOT* had the benefit of extraordinary effort to handle OOME gracefully.
>>
>> Before continuing with troubleshooting of SnapPuller, you're going to
>> need to fix the OOME error.  When you run out of memory, that is likely
>> to be the CAUSE of any errors you're seeing, not a symptom.
>>
>> There are exactly two ways to deal with OOME:  Increase the max heap, or
>> take steps to reduce the amount of heap required.  Increasing the heap
>> is the easiest option, and typically the first step.  Sometimes it's the
>> ONLY option.
>>
>> https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>>
>> Thanks,
>> Shawn
>>
>

Re: Memory Issue with SnapPuller

Posted by Jihwan Kim <ji...@gmail.com>.
Thank you Shawn. I understand the two options.
After my own testing with a smaller heap, I increased my heap size more
than triple, but OOME happens again with my testing cases under the
controlled thread process. Increased heap size just delayed the OOME.

Can you provide a feedback on my second question:  When the core reload
happens successfully (no matter it throws the exception or not), does Solr
need to call the openNewSearcherAndUpdateCommitPoint method?

As I described on my previous email, a thread created from
openNewSearcherAndUpdateCommitPoint method hangs and cause a high CPU usage
and a slow response time.  Attached image is the thread hung.



On Thu, Oct 20, 2016 at 9:29 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/20/2016 8:44 AM, Jihwan Kim wrote:
> > We are using Solr 4.10.4 and experiencing out of memory exception. It
> > seems the problem is cause by the following code & scenario.
>
> When you get an OutOfMemoryError exception that tells you there's not
> enough heap space, the place where the exception happens is frequently
> unrelated to the actual source of the problem.  Also, unless the
> programmer engages in extraordinary effort, encountering OOME will cause
> program behavior to become completely unpredictable.  Most of Solr has
> *NOT* had the benefit of extraordinary effort to handle OOME gracefully.
>
> Before continuing with troubleshooting of SnapPuller, you're going to
> need to fix the OOME error.  When you run out of memory, that is likely
> to be the CAUSE of any errors you're seeing, not a symptom.
>
> There are exactly two ways to deal with OOME:  Increase the max heap, or
> take steps to reduce the amount of heap required.  Increasing the heap
> is the easiest option, and typically the first step.  Sometimes it's the
> ONLY option.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>
> Thanks,
> Shawn
>
>

Re: Memory Issue with SnapPuller

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/20/2016 8:44 AM, Jihwan Kim wrote:
> We are using Solr 4.10.4 and experiencing out of memory exception. It
> seems the problem is cause by the following code & scenario. 

When you get an OutOfMemoryError exception that tells you there's not
enough heap space, the place where the exception happens is frequently
unrelated to the actual source of the problem.  Also, unless the
programmer engages in extraordinary effort, encountering OOME will cause
program behavior to become completely unpredictable.  Most of Solr has
*NOT* had the benefit of extraordinary effort to handle OOME gracefully.

Before continuing with troubleshooting of SnapPuller, you're going to
need to fix the OOME error.  When you run out of memory, that is likely
to be the CAUSE of any errors you're seeing, not a symptom.

There are exactly two ways to deal with OOME:  Increase the max heap, or
take steps to reduce the amount of heap required.  Increasing the heap
is the easiest option, and typically the first step.  Sometimes it's the
ONLY option.

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn