You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rallavagu <ra...@gmail.com> on 2015/10/03 01:12:35 UTC

Recovery Thread Blocked

Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper

During updates, some nodes are going very high cpu and becomes 
unavailable. The thread dump shows the following thread is blocked 870 
threads which explains high CPU. Any clues on where to look?

"Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked, 
native_blocked, daemon
     -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
     at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
     at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
     at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
     at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
     at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
     at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
     at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
     at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
     at 
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
     at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
     at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
     at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
     at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)

Re: Recovery Thread Blocked

Posted by Rallavagu <ra...@gmail.com>.
It is java thread though. Does it need increasing OS level threads?

On 10/6/15 6:21 PM, Mark Miller wrote:
> If it's a thread and you have plenty of RAM and the heap is fine, have you
> checked raising OS thread limits?
>
> - Mark
>
> On Tue, Oct 6, 2015 at 4:54 PM Rallavagu <ra...@gmail.com> wrote:
>
>> GC logging shows normal. The "OutOfMemoryError" appears to be pertaining
>> to a thread but not to JVM.
>>
>> On 10/6/15 1:07 PM, Mark Miller wrote:
>>> That amount of RAM can easily be eaten up depending on your sorting,
>>> faceting, data.
>>>
>>> Do you have gc logging enabled? That should describe what is happening
>> with
>>> the heap.
>>>
>>> - Mark
>>>
>>> On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <ra...@gmail.com> wrote:
>>>
>>>> Mark - currently 5.3 is being evaluated for upgrade purposes and
>>>> hopefully get there sooner. Meanwhile, following exception is noted from
>>>> logs during updates
>>>>
>>>> ERROR org.apache.solr.update.CommitTracker  – auto commit
>>>> error...:java.lang.IllegalStateException: this writer hit an
>>>> OutOfMemoryError; cannot commit
>>>>            at
>>>>
>>>>
>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
>>>>            at
>>>>
>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
>>>>            at
>>>>
>>>>
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
>>>>            at
>>>> org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
>>>>            at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
>>>>            at
>>>>
>>>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>>>>            at
>>>>
>>>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
>>>>            at
>>>>
>>>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
>>>>            at
>>>>
>>>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
>>>>            at java.lang.Thread.run(Thread.java:682)
>>>>
>>>> Considering the fact that the machine is configured with 48G (24G for
>>>> JVM which will be reduced in future) wondering how would it still go out
>>>> of memory. For memory mapped index files the remaining 24G or what is
>>>> available off of it should be available. Looking at the lsof output the
>>>> memory mapped files were around 10G.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On 10/5/15 5:41 PM, Mark Miller wrote:
>>>>> I'd make two guess:
>>>>>
>>>>> Looks like you are using Jrocket? I don't think that is common or well
>>>>> tested at this point.
>>>>>
>>>>> There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace
>>>> of
>>>>> SolrCloud, you are dealing with something fairly ancient and so it will
>>>> be
>>>>> harder to find help with older issues most likely.
>>>>>
>>>>> - Mark
>>>>>
>>>>> On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <ra...@gmail.com> wrote:
>>>>>
>>>>>> Any takers on this? Any kinda clue would help. Thanks.
>>>>>>
>>>>>> On 10/4/15 10:14 AM, Rallavagu wrote:
>>>>>>> As there were no responses so far, I assume that this is not a very
>>>>>>> common issue that folks come across. So, I went into source (4.6.1)
>> to
>>>>>>> see if I can figure out what could be the cause.
>>>>>>>
>>>>>>>
>>>>>>> The thread that is locking is in this block of code
>>>>>>>
>>>>>>> synchronized (recoveryLock) {
>>>>>>>           // to be air tight we must also check after lock
>>>>>>>           if (cc.isShutDown()) {
>>>>>>>             log.warn("Skipping recovery because Solr is shutdown");
>>>>>>>             return;
>>>>>>>           }
>>>>>>>           log.info("Running recovery - first canceling any ongoing
>>>>>> recovery");
>>>>>>>           cancelRecovery();
>>>>>>>
>>>>>>>           while (recoveryRunning) {
>>>>>>>             try {
>>>>>>>               recoveryLock.wait(1000);
>>>>>>>             } catch (InterruptedException e) {
>>>>>>>
>>>>>>>             }
>>>>>>>             // check again for those that were waiting
>>>>>>>             if (cc.isShutDown()) {
>>>>>>>               log.warn("Skipping recovery because Solr is shutdown");
>>>>>>>               return;
>>>>>>>             }
>>>>>>>             if (closed) return;
>>>>>>>           }
>>>>>>>
>>>>>>> Subsequently, the thread will get into cancelRecovery method as
>> below,
>>>>>>>
>>>>>>> public void cancelRecovery() {
>>>>>>>         synchronized (recoveryLock) {
>>>>>>>           if (recoveryStrat != null && recoveryRunning) {
>>>>>>>             recoveryStrat.close();
>>>>>>>             while (true) {
>>>>>>>               try {
>>>>>>>                 recoveryStrat.join();
>>>>>>>               } catch (InterruptedException e) {
>>>>>>>                 // not interruptible - keep waiting
>>>>>>>                 continue;
>>>>>>>               }
>>>>>>>               break;
>>>>>>>             }
>>>>>>>
>>>>>>>             recoveryRunning = false;
>>>>>>>             recoveryLock.notifyAll();
>>>>>>>           }
>>>>>>>         }
>>>>>>>       }
>>>>>>>
>>>>>>> As per the stack trace "recoveryStrat.join()" is where things are
>>>>>>> holding up.
>>>>>>>
>>>>>>> I wonder why/how cancelRecovery would take time so around 870 threads
>>>>>>> would be waiting on. Is it possible that ZK is not responding or
>>>>>>> something else like Operating System resources could cause this?
>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> On 10/2/15 4:17 PM, Rallavagu wrote:
>>>>>>>> Here is the stack trace of the thread that is holding the lock.
>>>>>>>>
>>>>>>>>
>>>>>>>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
>>>>>>>> native_blocked, daemon
>>>>>>>>         -- Waiting for notification on:
>>>>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>>>>>>>>         at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>>>>>         at
>> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>>>>>         at
>>>>>>>>
>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>>>>>         at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>>>>>         at
>> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>>>>>         at
>>>>>>>>
>>>>>>
>>>>
>> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
>>>>>>>>
>>>>>>>>
>>>>>>>>         at
>>>>>>>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
>>>>>>>> Method)
>>>>>>>>         at java/lang/Object.wait(J)V(Native Method)
>>>>>>>>         at java/lang/Thread.join(Thread.java:1206)
>>>>>>>>         ^-- Lock released while waiting:
>>>>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>>>>>>>>         at java/lang/Thread.join(Thread.java:1259)
>>>>>>>>         at
>>>>>>>>
>>>>>>
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
>>>>>>>>
>>>>>>>>
>>>>>>>>         ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
>>>>>>>>         at
>>>>>>>>
>>>>>>
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
>>>>>>>>
>>>>>>>>
>>>>>>>>         ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
>>>>>>>>         at
>>>>>>>>
>>>>>>
>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>>>>>
>>>>>>>>
>>>>>>>>         at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>>>>>>>
>>>>>>>>
>>>>>>>> Stack trace of one of the 870 threads that is waiting for the lock
>> to
>>>> be
>>>>>>>> released.
>>>>>>>>
>>>>>>>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
>>>>>>>> native_blocked, daemon
>>>>>>>>         -- Blocked trying to get lock: java/lang/Object@0x114d8dd00
>> [fat
>>>>>>>> lock]
>>>>>>>>         at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>>>>>         at
>> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>>>>>         at
>>>>>>>>
>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>>>>>         at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>>>>>         at
>> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>>>>>         at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>>>>>>>         at
>>>> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>>>>>>>         at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>>>>>>>         at
>>>>>>>>
>>>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>>>>>>>         at
>>>>>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>>>>>>>         at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>>>>>>>         at
>>>>>>>>
>>>>>>
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>>>>>>>
>>>>>>>>
>>>>>>>>         at
>>>>>>>>
>>>>>>
>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>>>>>
>>>>>>>>
>>>>>>>>         at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>>>>>>>
>>>>>>>> On 10/2/15 4:12 PM, Rallavagu wrote:
>>>>>>>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node
>>>> zookeeper
>>>>>>>>>
>>>>>>>>> During updates, some nodes are going very high cpu and becomes
>>>>>>>>> unavailable. The thread dump shows the following thread is blocked
>>>> 870
>>>>>>>>> threads which explains high CPU. Any clues on where to look?
>>>>>>>>>
>>>>>>>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
>>>>>>>>> native_blocked, daemon
>>>>>>>>>         -- Blocked trying to get lock: java/lang/Object@0x114d8dd00
>>>> [fat
>>>>>>>>> lock]
>>>>>>>>>         at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>>>>>>         at
>> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>>>>>>         at
>>>>>>>>>
>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>>>>>>         at
>> syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>>>>>>         at
>>>> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>>>>>>         at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>>>>>>>>         at
>>>> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>>>>>>>>         at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>>>>>>>>         at
>>>>>>>>>
>>>>>>
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>>>>>>>>         at
>>>>>>>>>
>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>>>>>>>>         at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>>>>>>>>         at
>>>>>>>>>
>>>>>>
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         at
>>>>>>>>>
>>>>>>
>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>>>>>
>>>>
>>

Re: Recovery Thread Blocked

Posted by Mark Miller <ma...@gmail.com>.
If it's a thread and you have plenty of RAM and the heap is fine, have you
checked raising OS thread limits?

- Mark

On Tue, Oct 6, 2015 at 4:54 PM Rallavagu <ra...@gmail.com> wrote:

> GC logging shows normal. The "OutOfMemoryError" appears to be pertaining
> to a thread but not to JVM.
>
> On 10/6/15 1:07 PM, Mark Miller wrote:
> > That amount of RAM can easily be eaten up depending on your sorting,
> > faceting, data.
> >
> > Do you have gc logging enabled? That should describe what is happening
> with
> > the heap.
> >
> > - Mark
> >
> > On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <ra...@gmail.com> wrote:
> >
> >> Mark - currently 5.3 is being evaluated for upgrade purposes and
> >> hopefully get there sooner. Meanwhile, following exception is noted from
> >> logs during updates
> >>
> >> ERROR org.apache.solr.update.CommitTracker  – auto commit
> >> error...:java.lang.IllegalStateException: this writer hit an
> >> OutOfMemoryError; cannot commit
> >>           at
> >>
> >>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
> >>           at
> >>
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
> >>           at
> >>
> >>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
> >>           at
> >> org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
> >>           at
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
> >>           at
> >>
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
> >>           at
> >>
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
> >>           at
> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
> >>           at
> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
> >>           at java.lang.Thread.run(Thread.java:682)
> >>
> >> Considering the fact that the machine is configured with 48G (24G for
> >> JVM which will be reduced in future) wondering how would it still go out
> >> of memory. For memory mapped index files the remaining 24G or what is
> >> available off of it should be available. Looking at the lsof output the
> >> memory mapped files were around 10G.
> >>
> >> Thanks.
> >>
> >>
> >> On 10/5/15 5:41 PM, Mark Miller wrote:
> >>> I'd make two guess:
> >>>
> >>> Looks like you are using Jrocket? I don't think that is common or well
> >>> tested at this point.
> >>>
> >>> There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace
> >> of
> >>> SolrCloud, you are dealing with something fairly ancient and so it will
> >> be
> >>> harder to find help with older issues most likely.
> >>>
> >>> - Mark
> >>>
> >>> On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <ra...@gmail.com> wrote:
> >>>
> >>>> Any takers on this? Any kinda clue would help. Thanks.
> >>>>
> >>>> On 10/4/15 10:14 AM, Rallavagu wrote:
> >>>>> As there were no responses so far, I assume that this is not a very
> >>>>> common issue that folks come across. So, I went into source (4.6.1)
> to
> >>>>> see if I can figure out what could be the cause.
> >>>>>
> >>>>>
> >>>>> The thread that is locking is in this block of code
> >>>>>
> >>>>> synchronized (recoveryLock) {
> >>>>>          // to be air tight we must also check after lock
> >>>>>          if (cc.isShutDown()) {
> >>>>>            log.warn("Skipping recovery because Solr is shutdown");
> >>>>>            return;
> >>>>>          }
> >>>>>          log.info("Running recovery - first canceling any ongoing
> >>>> recovery");
> >>>>>          cancelRecovery();
> >>>>>
> >>>>>          while (recoveryRunning) {
> >>>>>            try {
> >>>>>              recoveryLock.wait(1000);
> >>>>>            } catch (InterruptedException e) {
> >>>>>
> >>>>>            }
> >>>>>            // check again for those that were waiting
> >>>>>            if (cc.isShutDown()) {
> >>>>>              log.warn("Skipping recovery because Solr is shutdown");
> >>>>>              return;
> >>>>>            }
> >>>>>            if (closed) return;
> >>>>>          }
> >>>>>
> >>>>> Subsequently, the thread will get into cancelRecovery method as
> below,
> >>>>>
> >>>>> public void cancelRecovery() {
> >>>>>        synchronized (recoveryLock) {
> >>>>>          if (recoveryStrat != null && recoveryRunning) {
> >>>>>            recoveryStrat.close();
> >>>>>            while (true) {
> >>>>>              try {
> >>>>>                recoveryStrat.join();
> >>>>>              } catch (InterruptedException e) {
> >>>>>                // not interruptible - keep waiting
> >>>>>                continue;
> >>>>>              }
> >>>>>              break;
> >>>>>            }
> >>>>>
> >>>>>            recoveryRunning = false;
> >>>>>            recoveryLock.notifyAll();
> >>>>>          }
> >>>>>        }
> >>>>>      }
> >>>>>
> >>>>> As per the stack trace "recoveryStrat.join()" is where things are
> >>>>> holding up.
> >>>>>
> >>>>> I wonder why/how cancelRecovery would take time so around 870 threads
> >>>>> would be waiting on. Is it possible that ZK is not responding or
> >>>>> something else like Operating System resources could cause this?
> >> Thanks.
> >>>>>
> >>>>>
> >>>>> On 10/2/15 4:17 PM, Rallavagu wrote:
> >>>>>> Here is the stack trace of the thread that is holding the lock.
> >>>>>>
> >>>>>>
> >>>>>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
> >>>>>> native_blocked, daemon
> >>>>>>        -- Waiting for notification on:
> >>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
> >>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>>>        at
> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>>>        at
> >>>>>>
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>>>        at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>>>        at
> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
> >>>>>>
> >>>>>>
> >>>>>>        at
> >>>>>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
> >>>>>> Method)
> >>>>>>        at java/lang/Object.wait(J)V(Native Method)
> >>>>>>        at java/lang/Thread.join(Thread.java:1206)
> >>>>>>        ^-- Lock released while waiting:
> >>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
> >>>>>>        at java/lang/Thread.join(Thread.java:1259)
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
> >>>>>>
> >>>>>>
> >>>>>>        ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
> >>>>>>
> >>>>>>
> >>>>>>        ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>>>
> >>>>>>
> >>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>>>>>
> >>>>>>
> >>>>>> Stack trace of one of the 870 threads that is waiting for the lock
> to
> >> be
> >>>>>> released.
> >>>>>>
> >>>>>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
> >>>>>> native_blocked, daemon
> >>>>>>        -- Blocked trying to get lock: java/lang/Object@0x114d8dd00
> [fat
> >>>>>> lock]
> >>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>>>        at
> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>>>        at
> >>>>>>
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>>>        at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>>>        at
> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>>>        at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
> >>>>>>        at
> >> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
> >>>>>>        at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
> >>>>>>        at
> >>>>>>
> >> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
> >>>>>>        at
> >>>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
> >>>>>>        at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
> >>>>>>
> >>>>>>
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>>>
> >>>>>>
> >>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>>>>>
> >>>>>> On 10/2/15 4:12 PM, Rallavagu wrote:
> >>>>>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node
> >> zookeeper
> >>>>>>>
> >>>>>>> During updates, some nodes are going very high cpu and becomes
> >>>>>>> unavailable. The thread dump shows the following thread is blocked
> >> 870
> >>>>>>> threads which explains high CPU. Any clues on where to look?
> >>>>>>>
> >>>>>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
> >>>>>>> native_blocked, daemon
> >>>>>>>        -- Blocked trying to get lock: java/lang/Object@0x114d8dd00
> >> [fat
> >>>>>>> lock]
> >>>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>>>>        at
> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>>>>        at
> >>>>>>>
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>>>>        at
> syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>>>>        at
> >> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>>>>        at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
> >>>>>>>        at
> >> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
> >>>>>>>        at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
> >>>>>>>        at
> >>>>>>>
> >>>>
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
> >>>>>>>        at
> >>>>>>>
> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
> >>>>>>>        at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
> >>>>>>>        at
> >>>>>>>
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>        at
> >>>>>>>
> >>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>>>
> >>
>
-- 
- Mark
about.me/markrmiller

Re: Recovery Thread Blocked

Posted by Rallavagu <ra...@gmail.com>.
GC logging shows normal. The "OutOfMemoryError" appears to be pertaining 
to a thread but not to JVM.

On 10/6/15 1:07 PM, Mark Miller wrote:
> That amount of RAM can easily be eaten up depending on your sorting,
> faceting, data.
>
> Do you have gc logging enabled? That should describe what is happening with
> the heap.
>
> - Mark
>
> On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <ra...@gmail.com> wrote:
>
>> Mark - currently 5.3 is being evaluated for upgrade purposes and
>> hopefully get there sooner. Meanwhile, following exception is noted from
>> logs during updates
>>
>> ERROR org.apache.solr.update.CommitTracker  – auto commit
>> error...:java.lang.IllegalStateException: this writer hit an
>> OutOfMemoryError; cannot commit
>>           at
>>
>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
>>           at
>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
>>           at
>>
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
>>           at
>> org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
>>           at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
>>           at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>>           at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
>>           at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
>>           at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
>>           at java.lang.Thread.run(Thread.java:682)
>>
>> Considering the fact that the machine is configured with 48G (24G for
>> JVM which will be reduced in future) wondering how would it still go out
>> of memory. For memory mapped index files the remaining 24G or what is
>> available off of it should be available. Looking at the lsof output the
>> memory mapped files were around 10G.
>>
>> Thanks.
>>
>>
>> On 10/5/15 5:41 PM, Mark Miller wrote:
>>> I'd make two guess:
>>>
>>> Looks like you are using Jrocket? I don't think that is common or well
>>> tested at this point.
>>>
>>> There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace
>> of
>>> SolrCloud, you are dealing with something fairly ancient and so it will
>> be
>>> harder to find help with older issues most likely.
>>>
>>> - Mark
>>>
>>> On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <ra...@gmail.com> wrote:
>>>
>>>> Any takers on this? Any kinda clue would help. Thanks.
>>>>
>>>> On 10/4/15 10:14 AM, Rallavagu wrote:
>>>>> As there were no responses so far, I assume that this is not a very
>>>>> common issue that folks come across. So, I went into source (4.6.1) to
>>>>> see if I can figure out what could be the cause.
>>>>>
>>>>>
>>>>> The thread that is locking is in this block of code
>>>>>
>>>>> synchronized (recoveryLock) {
>>>>>          // to be air tight we must also check after lock
>>>>>          if (cc.isShutDown()) {
>>>>>            log.warn("Skipping recovery because Solr is shutdown");
>>>>>            return;
>>>>>          }
>>>>>          log.info("Running recovery - first canceling any ongoing
>>>> recovery");
>>>>>          cancelRecovery();
>>>>>
>>>>>          while (recoveryRunning) {
>>>>>            try {
>>>>>              recoveryLock.wait(1000);
>>>>>            } catch (InterruptedException e) {
>>>>>
>>>>>            }
>>>>>            // check again for those that were waiting
>>>>>            if (cc.isShutDown()) {
>>>>>              log.warn("Skipping recovery because Solr is shutdown");
>>>>>              return;
>>>>>            }
>>>>>            if (closed) return;
>>>>>          }
>>>>>
>>>>> Subsequently, the thread will get into cancelRecovery method as below,
>>>>>
>>>>> public void cancelRecovery() {
>>>>>        synchronized (recoveryLock) {
>>>>>          if (recoveryStrat != null && recoveryRunning) {
>>>>>            recoveryStrat.close();
>>>>>            while (true) {
>>>>>              try {
>>>>>                recoveryStrat.join();
>>>>>              } catch (InterruptedException e) {
>>>>>                // not interruptible - keep waiting
>>>>>                continue;
>>>>>              }
>>>>>              break;
>>>>>            }
>>>>>
>>>>>            recoveryRunning = false;
>>>>>            recoveryLock.notifyAll();
>>>>>          }
>>>>>        }
>>>>>      }
>>>>>
>>>>> As per the stack trace "recoveryStrat.join()" is where things are
>>>>> holding up.
>>>>>
>>>>> I wonder why/how cancelRecovery would take time so around 870 threads
>>>>> would be waiting on. Is it possible that ZK is not responding or
>>>>> something else like Operating System resources could cause this?
>> Thanks.
>>>>>
>>>>>
>>>>> On 10/2/15 4:17 PM, Rallavagu wrote:
>>>>>> Here is the stack trace of the thread that is holding the lock.
>>>>>>
>>>>>>
>>>>>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
>>>>>> native_blocked, daemon
>>>>>>        -- Waiting for notification on:
>>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>>>        at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>>>        at
>>>>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>>>        at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>>>        at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>>>        at
>>>>>>
>>>>
>> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
>>>>>>
>>>>>>
>>>>>>        at
>>>>>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
>>>>>> Method)
>>>>>>        at java/lang/Object.wait(J)V(Native Method)
>>>>>>        at java/lang/Thread.join(Thread.java:1206)
>>>>>>        ^-- Lock released while waiting:
>>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>>>>>>        at java/lang/Thread.join(Thread.java:1259)
>>>>>>        at
>>>>>>
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
>>>>>>
>>>>>>
>>>>>>        ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
>>>>>>        at
>>>>>>
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
>>>>>>
>>>>>>
>>>>>>        ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
>>>>>>        at
>>>>>>
>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>>>
>>>>>>
>>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>>>>>
>>>>>>
>>>>>> Stack trace of one of the 870 threads that is waiting for the lock to
>> be
>>>>>> released.
>>>>>>
>>>>>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
>>>>>> native_blocked, daemon
>>>>>>        -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
>>>>>> lock]
>>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>>>        at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>>>        at
>>>>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>>>        at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>>>        at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>>>        at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>>>>>        at
>> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>>>>>        at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>>>>>        at
>>>>>>
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>>>>>        at
>>>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>>>>>        at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>>>>>        at
>>>>>>
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>>>>>
>>>>>>
>>>>>>        at
>>>>>>
>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>>>
>>>>>>
>>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>>>>>
>>>>>> On 10/2/15 4:12 PM, Rallavagu wrote:
>>>>>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node
>> zookeeper
>>>>>>>
>>>>>>> During updates, some nodes are going very high cpu and becomes
>>>>>>> unavailable. The thread dump shows the following thread is blocked
>> 870
>>>>>>> threads which explains high CPU. Any clues on where to look?
>>>>>>>
>>>>>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
>>>>>>> native_blocked, daemon
>>>>>>>        -- Blocked trying to get lock: java/lang/Object@0x114d8dd00
>> [fat
>>>>>>> lock]
>>>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>>>>        at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>>>>        at
>>>>>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>>>>        at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>>>>        at
>> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>>>>        at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>>>>>>        at
>> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>>>>>>        at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>>>>>>        at
>>>>>>>
>>>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>>>>>>        at
>>>>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>>>>>>        at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>>>>>>        at
>>>>>>>
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>        at
>>>>>>>
>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>>>
>>

Re: Recovery Thread Blocked

Posted by Mark Miller <ma...@gmail.com>.
That amount of RAM can easily be eaten up depending on your sorting,
faceting, data.

Do you have gc logging enabled? That should describe what is happening with
the heap.

- Mark

On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <ra...@gmail.com> wrote:

> Mark - currently 5.3 is being evaluated for upgrade purposes and
> hopefully get there sooner. Meanwhile, following exception is noted from
> logs during updates
>
> ERROR org.apache.solr.update.CommitTracker  – auto commit
> error...:java.lang.IllegalStateException: this writer hit an
> OutOfMemoryError; cannot commit
>          at
>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
>          at
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
>          at
>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
>          at
> org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
>          at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
>          at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>          at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
>          at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
>          at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
>          at java.lang.Thread.run(Thread.java:682)
>
> Considering the fact that the machine is configured with 48G (24G for
> JVM which will be reduced in future) wondering how would it still go out
> of memory. For memory mapped index files the remaining 24G or what is
> available off of it should be available. Looking at the lsof output the
> memory mapped files were around 10G.
>
> Thanks.
>
>
> On 10/5/15 5:41 PM, Mark Miller wrote:
> > I'd make two guess:
> >
> > Looks like you are using Jrocket? I don't think that is common or well
> > tested at this point.
> >
> > There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace
> of
> > SolrCloud, you are dealing with something fairly ancient and so it will
> be
> > harder to find help with older issues most likely.
> >
> > - Mark
> >
> > On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <ra...@gmail.com> wrote:
> >
> >> Any takers on this? Any kinda clue would help. Thanks.
> >>
> >> On 10/4/15 10:14 AM, Rallavagu wrote:
> >>> As there were no responses so far, I assume that this is not a very
> >>> common issue that folks come across. So, I went into source (4.6.1) to
> >>> see if I can figure out what could be the cause.
> >>>
> >>>
> >>> The thread that is locking is in this block of code
> >>>
> >>> synchronized (recoveryLock) {
> >>>         // to be air tight we must also check after lock
> >>>         if (cc.isShutDown()) {
> >>>           log.warn("Skipping recovery because Solr is shutdown");
> >>>           return;
> >>>         }
> >>>         log.info("Running recovery - first canceling any ongoing
> >> recovery");
> >>>         cancelRecovery();
> >>>
> >>>         while (recoveryRunning) {
> >>>           try {
> >>>             recoveryLock.wait(1000);
> >>>           } catch (InterruptedException e) {
> >>>
> >>>           }
> >>>           // check again for those that were waiting
> >>>           if (cc.isShutDown()) {
> >>>             log.warn("Skipping recovery because Solr is shutdown");
> >>>             return;
> >>>           }
> >>>           if (closed) return;
> >>>         }
> >>>
> >>> Subsequently, the thread will get into cancelRecovery method as below,
> >>>
> >>> public void cancelRecovery() {
> >>>       synchronized (recoveryLock) {
> >>>         if (recoveryStrat != null && recoveryRunning) {
> >>>           recoveryStrat.close();
> >>>           while (true) {
> >>>             try {
> >>>               recoveryStrat.join();
> >>>             } catch (InterruptedException e) {
> >>>               // not interruptible - keep waiting
> >>>               continue;
> >>>             }
> >>>             break;
> >>>           }
> >>>
> >>>           recoveryRunning = false;
> >>>           recoveryLock.notifyAll();
> >>>         }
> >>>       }
> >>>     }
> >>>
> >>> As per the stack trace "recoveryStrat.join()" is where things are
> >>> holding up.
> >>>
> >>> I wonder why/how cancelRecovery would take time so around 870 threads
> >>> would be waiting on. Is it possible that ZK is not responding or
> >>> something else like Operating System resources could cause this?
> Thanks.
> >>>
> >>>
> >>> On 10/2/15 4:17 PM, Rallavagu wrote:
> >>>> Here is the stack trace of the thread that is holding the lock.
> >>>>
> >>>>
> >>>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
> >>>> native_blocked, daemon
> >>>>       -- Waiting for notification on:
> >>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
> >>>>       at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>       at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>       at
> >>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>       at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>       at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>       at
> >>>>
> >>
> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
> >>>>
> >>>>
> >>>>       at
> >>>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
> >>>> Method)
> >>>>       at java/lang/Object.wait(J)V(Native Method)
> >>>>       at java/lang/Thread.join(Thread.java:1206)
> >>>>       ^-- Lock released while waiting:
> >>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
> >>>>       at java/lang/Thread.join(Thread.java:1259)
> >>>>       at
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
> >>>>
> >>>>
> >>>>       ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
> >>>>       at
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
> >>>>
> >>>>
> >>>>       ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
> >>>>       at
> >>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>
> >>>>
> >>>>       at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>>>
> >>>>
> >>>> Stack trace of one of the 870 threads that is waiting for the lock to
> be
> >>>> released.
> >>>>
> >>>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
> >>>> native_blocked, daemon
> >>>>       -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
> >>>> lock]
> >>>>       at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>       at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>       at
> >>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>       at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>       at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>       at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
> >>>>       at
> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
> >>>>       at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
> >>>>       at
> >>>>
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
> >>>>       at
> >>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
> >>>>       at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
> >>>>       at
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
> >>>>
> >>>>
> >>>>       at
> >>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>
> >>>>
> >>>>       at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>>>
> >>>> On 10/2/15 4:12 PM, Rallavagu wrote:
> >>>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node
> zookeeper
> >>>>>
> >>>>> During updates, some nodes are going very high cpu and becomes
> >>>>> unavailable. The thread dump shows the following thread is blocked
> 870
> >>>>> threads which explains high CPU. Any clues on where to look?
> >>>>>
> >>>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
> >>>>> native_blocked, daemon
> >>>>>       -- Blocked trying to get lock: java/lang/Object@0x114d8dd00
> [fat
> >>>>> lock]
> >>>>>       at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>>       at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>>       at
> >>>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>>       at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>>       at
> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>>       at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
> >>>>>       at
> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
> >>>>>       at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
> >>>>>       at
> >>>>>
> >> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
> >>>>>       at
> >>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
> >>>>>       at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
> >>>>>       at
> >>>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
> >>>>>
> >>>>>
> >>>>>
> >>>>>       at
> >>>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>>
> >>>>>
> >>>>>
> >>>>>       at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>
>
-- 
- Mark
about.me/markrmiller

Re: Recovery Thread Blocked

Posted by Rallavagu <ra...@gmail.com>.
Mark - currently 5.3 is being evaluated for upgrade purposes and 
hopefully get there sooner. Meanwhile, following exception is noted from 
logs during updates

ERROR org.apache.solr.update.CommitTracker  – auto commit 
error...:java.lang.IllegalStateException: this writer hit an 
OutOfMemoryError; cannot commit
         at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
         at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
         at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
         at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
         at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
         at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
         at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
         at java.lang.Thread.run(Thread.java:682)

Considering the fact that the machine is configured with 48G (24G for 
JVM which will be reduced in future) wondering how would it still go out 
of memory. For memory mapped index files the remaining 24G or what is 
available off of it should be available. Looking at the lsof output the 
memory mapped files were around 10G.

Thanks.


On 10/5/15 5:41 PM, Mark Miller wrote:
> I'd make two guess:
>
> Looks like you are using Jrocket? I don't think that is common or well
> tested at this point.
>
> There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace of
> SolrCloud, you are dealing with something fairly ancient and so it will be
> harder to find help with older issues most likely.
>
> - Mark
>
> On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <ra...@gmail.com> wrote:
>
>> Any takers on this? Any kinda clue would help. Thanks.
>>
>> On 10/4/15 10:14 AM, Rallavagu wrote:
>>> As there were no responses so far, I assume that this is not a very
>>> common issue that folks come across. So, I went into source (4.6.1) to
>>> see if I can figure out what could be the cause.
>>>
>>>
>>> The thread that is locking is in this block of code
>>>
>>> synchronized (recoveryLock) {
>>>         // to be air tight we must also check after lock
>>>         if (cc.isShutDown()) {
>>>           log.warn("Skipping recovery because Solr is shutdown");
>>>           return;
>>>         }
>>>         log.info("Running recovery - first canceling any ongoing
>> recovery");
>>>         cancelRecovery();
>>>
>>>         while (recoveryRunning) {
>>>           try {
>>>             recoveryLock.wait(1000);
>>>           } catch (InterruptedException e) {
>>>
>>>           }
>>>           // check again for those that were waiting
>>>           if (cc.isShutDown()) {
>>>             log.warn("Skipping recovery because Solr is shutdown");
>>>             return;
>>>           }
>>>           if (closed) return;
>>>         }
>>>
>>> Subsequently, the thread will get into cancelRecovery method as below,
>>>
>>> public void cancelRecovery() {
>>>       synchronized (recoveryLock) {
>>>         if (recoveryStrat != null && recoveryRunning) {
>>>           recoveryStrat.close();
>>>           while (true) {
>>>             try {
>>>               recoveryStrat.join();
>>>             } catch (InterruptedException e) {
>>>               // not interruptible - keep waiting
>>>               continue;
>>>             }
>>>             break;
>>>           }
>>>
>>>           recoveryRunning = false;
>>>           recoveryLock.notifyAll();
>>>         }
>>>       }
>>>     }
>>>
>>> As per the stack trace "recoveryStrat.join()" is where things are
>>> holding up.
>>>
>>> I wonder why/how cancelRecovery would take time so around 870 threads
>>> would be waiting on. Is it possible that ZK is not responding or
>>> something else like Operating System resources could cause this? Thanks.
>>>
>>>
>>> On 10/2/15 4:17 PM, Rallavagu wrote:
>>>> Here is the stack trace of the thread that is holding the lock.
>>>>
>>>>
>>>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
>>>> native_blocked, daemon
>>>>       -- Waiting for notification on:
>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>>>>       at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>       at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>       at
>>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>       at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>       at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>       at
>>>>
>> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
>>>>
>>>>
>>>>       at
>>>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
>>>> Method)
>>>>       at java/lang/Object.wait(J)V(Native Method)
>>>>       at java/lang/Thread.join(Thread.java:1206)
>>>>       ^-- Lock released while waiting:
>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>>>>       at java/lang/Thread.join(Thread.java:1259)
>>>>       at
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
>>>>
>>>>
>>>>       ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
>>>>       at
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
>>>>
>>>>
>>>>       ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
>>>>       at
>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>
>>>>
>>>>       at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>>>
>>>>
>>>> Stack trace of one of the 870 threads that is waiting for the lock to be
>>>> released.
>>>>
>>>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
>>>> native_blocked, daemon
>>>>       -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
>>>> lock]
>>>>       at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>       at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>       at
>>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>       at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>       at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>       at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>>>       at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>>>       at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>>>       at
>>>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>>>       at
>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>>>       at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>>>       at
>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>>>
>>>>
>>>>       at
>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>
>>>>
>>>>       at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>>>
>>>> On 10/2/15 4:12 PM, Rallavagu wrote:
>>>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper
>>>>>
>>>>> During updates, some nodes are going very high cpu and becomes
>>>>> unavailable. The thread dump shows the following thread is blocked 870
>>>>> threads which explains high CPU. Any clues on where to look?
>>>>>
>>>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
>>>>> native_blocked, daemon
>>>>>       -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
>>>>> lock]
>>>>>       at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>>>       at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>>>       at
>>>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>>>       at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>>>       at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>>>       at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>>>>       at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>>>>       at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>>>>       at
>>>>>
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>>>>       at
>>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>>>>       at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>>>>       at
>>>>>
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>>>>
>>>>>
>>>>>
>>>>>       at
>>>>>
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>>>
>>>>>
>>>>>
>>>>>       at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>

Re: Recovery Thread Blocked

Posted by Mark Miller <ma...@gmail.com>.
I'd make two guess:

Looks like you are using Jrocket? I don't think that is common or well
tested at this point.

There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace of
SolrCloud, you are dealing with something fairly ancient and so it will be
harder to find help with older issues most likely.

- Mark

On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <ra...@gmail.com> wrote:

> Any takers on this? Any kinda clue would help. Thanks.
>
> On 10/4/15 10:14 AM, Rallavagu wrote:
> > As there were no responses so far, I assume that this is not a very
> > common issue that folks come across. So, I went into source (4.6.1) to
> > see if I can figure out what could be the cause.
> >
> >
> > The thread that is locking is in this block of code
> >
> > synchronized (recoveryLock) {
> >        // to be air tight we must also check after lock
> >        if (cc.isShutDown()) {
> >          log.warn("Skipping recovery because Solr is shutdown");
> >          return;
> >        }
> >        log.info("Running recovery - first canceling any ongoing
> recovery");
> >        cancelRecovery();
> >
> >        while (recoveryRunning) {
> >          try {
> >            recoveryLock.wait(1000);
> >          } catch (InterruptedException e) {
> >
> >          }
> >          // check again for those that were waiting
> >          if (cc.isShutDown()) {
> >            log.warn("Skipping recovery because Solr is shutdown");
> >            return;
> >          }
> >          if (closed) return;
> >        }
> >
> > Subsequently, the thread will get into cancelRecovery method as below,
> >
> > public void cancelRecovery() {
> >      synchronized (recoveryLock) {
> >        if (recoveryStrat != null && recoveryRunning) {
> >          recoveryStrat.close();
> >          while (true) {
> >            try {
> >              recoveryStrat.join();
> >            } catch (InterruptedException e) {
> >              // not interruptible - keep waiting
> >              continue;
> >            }
> >            break;
> >          }
> >
> >          recoveryRunning = false;
> >          recoveryLock.notifyAll();
> >        }
> >      }
> >    }
> >
> > As per the stack trace "recoveryStrat.join()" is where things are
> > holding up.
> >
> > I wonder why/how cancelRecovery would take time so around 870 threads
> > would be waiting on. Is it possible that ZK is not responding or
> > something else like Operating System resources could cause this? Thanks.
> >
> >
> > On 10/2/15 4:17 PM, Rallavagu wrote:
> >> Here is the stack trace of the thread that is holding the lock.
> >>
> >>
> >> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
> >> native_blocked, daemon
> >>      -- Waiting for notification on:
> >> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
> >>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>      at
> >> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>      at
> >>
> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
> >>
> >>
> >>      at
> >> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
> >> Method)
> >>      at java/lang/Object.wait(J)V(Native Method)
> >>      at java/lang/Thread.join(Thread.java:1206)
> >>      ^-- Lock released while waiting:
> >> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
> >>      at java/lang/Thread.join(Thread.java:1259)
> >>      at
> >>
> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
> >>
> >>
> >>      ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
> >>      at
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
> >>
> >>
> >>      ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
> >>      at
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>
> >>
> >>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>
> >>
> >> Stack trace of one of the 870 threads that is waiting for the lock to be
> >> released.
> >>
> >> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
> >> native_blocked, daemon
> >>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
> >> lock]
> >>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>      at
> >> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
> >>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
> >>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
> >>      at
> >> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
> >>      at
> >> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
> >>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
> >>      at
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
> >>
> >>
> >>      at
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>
> >>
> >>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>
> >> On 10/2/15 4:12 PM, Rallavagu wrote:
> >>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper
> >>>
> >>> During updates, some nodes are going very high cpu and becomes
> >>> unavailable. The thread dump shows the following thread is blocked 870
> >>> threads which explains high CPU. Any clues on where to look?
> >>>
> >>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
> >>> native_blocked, daemon
> >>>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
> >>> lock]
> >>>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>      at
> >>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
> >>>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
> >>>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
> >>>      at
> >>>
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
> >>>      at
> >>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
> >>>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
> >>>      at
> >>>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
> >>>
> >>>
> >>>
> >>>      at
> >>>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>
> >>>
> >>>
> >>>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>
-- 
- Mark
about.me/markrmiller

Re: Recovery Thread Blocked

Posted by Rallavagu <ra...@gmail.com>.
Any takers on this? Any kinda clue would help. Thanks.

On 10/4/15 10:14 AM, Rallavagu wrote:
> As there were no responses so far, I assume that this is not a very
> common issue that folks come across. So, I went into source (4.6.1) to
> see if I can figure out what could be the cause.
>
>
> The thread that is locking is in this block of code
>
> synchronized (recoveryLock) {
>        // to be air tight we must also check after lock
>        if (cc.isShutDown()) {
>          log.warn("Skipping recovery because Solr is shutdown");
>          return;
>        }
>        log.info("Running recovery - first canceling any ongoing recovery");
>        cancelRecovery();
>
>        while (recoveryRunning) {
>          try {
>            recoveryLock.wait(1000);
>          } catch (InterruptedException e) {
>
>          }
>          // check again for those that were waiting
>          if (cc.isShutDown()) {
>            log.warn("Skipping recovery because Solr is shutdown");
>            return;
>          }
>          if (closed) return;
>        }
>
> Subsequently, the thread will get into cancelRecovery method as below,
>
> public void cancelRecovery() {
>      synchronized (recoveryLock) {
>        if (recoveryStrat != null && recoveryRunning) {
>          recoveryStrat.close();
>          while (true) {
>            try {
>              recoveryStrat.join();
>            } catch (InterruptedException e) {
>              // not interruptible - keep waiting
>              continue;
>            }
>            break;
>          }
>
>          recoveryRunning = false;
>          recoveryLock.notifyAll();
>        }
>      }
>    }
>
> As per the stack trace "recoveryStrat.join()" is where things are
> holding up.
>
> I wonder why/how cancelRecovery would take time so around 870 threads
> would be waiting on. Is it possible that ZK is not responding or
> something else like Operating System resources could cause this? Thanks.
>
>
> On 10/2/15 4:17 PM, Rallavagu wrote:
>> Here is the stack trace of the thread that is holding the lock.
>>
>>
>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
>> native_blocked, daemon
>>      -- Waiting for notification on:
>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>      at
>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>      at
>> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
>>
>>
>>      at
>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
>> Method)
>>      at java/lang/Object.wait(J)V(Native Method)
>>      at java/lang/Thread.join(Thread.java:1206)
>>      ^-- Lock released while waiting:
>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>>      at java/lang/Thread.join(Thread.java:1259)
>>      at
>> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
>>
>>
>>      ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
>>      at
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
>>
>>
>>      ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
>>      at
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>
>>
>>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>
>>
>> Stack trace of one of the 870 threads that is waiting for the lock to be
>> released.
>>
>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
>> native_blocked, daemon
>>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
>> lock]
>>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>      at
>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>      at
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>      at
>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>      at
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>
>>
>>      at
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>
>>
>>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>>
>> On 10/2/15 4:12 PM, Rallavagu wrote:
>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper
>>>
>>> During updates, some nodes are going very high cpu and becomes
>>> unavailable. The thread dump shows the following thread is blocked 870
>>> threads which explains high CPU. Any clues on where to look?
>>>
>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
>>> native_blocked, daemon
>>>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
>>> lock]
>>>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>>      at
>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>>      at
>>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>>      at
>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>>      at
>>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>>
>>>
>>>
>>>      at
>>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>>
>>>
>>>
>>>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)

Re: Recovery Thread Blocked

Posted by Rallavagu <ra...@gmail.com>.
As there were no responses so far, I assume that this is not a very 
common issue that folks come across. So, I went into source (4.6.1) to 
see if I can figure out what could be the cause.


The thread that is locking is in this block of code

synchronized (recoveryLock) {
       // to be air tight we must also check after lock
       if (cc.isShutDown()) {
         log.warn("Skipping recovery because Solr is shutdown");
         return;
       }
       log.info("Running recovery - first canceling any ongoing recovery");
       cancelRecovery();

       while (recoveryRunning) {
         try {
           recoveryLock.wait(1000);
         } catch (InterruptedException e) {

         }
         // check again for those that were waiting
         if (cc.isShutDown()) {
           log.warn("Skipping recovery because Solr is shutdown");
           return;
         }
         if (closed) return;
       }

Subsequently, the thread will get into cancelRecovery method as below,

public void cancelRecovery() {
     synchronized (recoveryLock) {
       if (recoveryStrat != null && recoveryRunning) {
         recoveryStrat.close();
         while (true) {
           try {
             recoveryStrat.join();
           } catch (InterruptedException e) {
             // not interruptible - keep waiting
             continue;
           }
           break;
         }

         recoveryRunning = false;
         recoveryLock.notifyAll();
       }
     }
   }

As per the stack trace "recoveryStrat.join()" is where things are 
holding up.

I wonder why/how cancelRecovery would take time so around 870 threads 
would be waiting on. Is it possible that ZK is not responding or 
something else like Operating System resources could cause this? Thanks.


On 10/2/15 4:17 PM, Rallavagu wrote:
> Here is the stack trace of the thread that is holding the lock.
>
>
> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
> native_blocked, daemon
>      -- Waiting for notification on:
> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>      at
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>      at
> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
>
>      at
> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method)
>      at java/lang/Object.wait(J)V(Native Method)
>      at java/lang/Thread.join(Thread.java:1206)
>      ^-- Lock released while waiting:
> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>      at java/lang/Thread.join(Thread.java:1259)
>      at
> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
>
>      ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
>      at
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
>
>      ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
>      at
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>
>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>
>
> Stack trace of one of the 870 threads that is waiting for the lock to be
> released.
>
> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
> native_blocked, daemon
>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>      at
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>      at
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>      at
> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>      at
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>
>      at
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>
>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>
> On 10/2/15 4:12 PM, Rallavagu wrote:
>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper
>>
>> During updates, some nodes are going very high cpu and becomes
>> unavailable. The thread dump shows the following thread is blocked 870
>> threads which explains high CPU. Any clues on where to look?
>>
>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
>> native_blocked, daemon
>>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
>> lock]
>>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>      at
>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>      at
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>      at
>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>      at
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>
>>
>>      at
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>
>>
>>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)

Re: Recovery Thread Blocked

Posted by Rallavagu <ra...@gmail.com>.
Here is the stack trace of the thread that is holding the lock.


"Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting, 
native_blocked, daemon
     -- Waiting for notification on: 
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
     at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
     at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
     at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
     at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
     at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
     at 
RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
     at 
jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method)
     at java/lang/Object.wait(J)V(Native Method)
     at java/lang/Thread.join(Thread.java:1206)
     ^-- Lock released while waiting: 
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
     at java/lang/Thread.join(Thread.java:1259)
     at 
org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
     ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
     at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
     ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
     at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
     at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)


Stack trace of one of the 870 threads that is waiting for the lock to be 
released.

"Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked, 
native_blocked, daemon
     -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
     at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
     at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
     at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
     at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
     at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
     at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
     at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
     at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
     at 
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
     at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
     at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
     at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
     at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)

On 10/2/15 4:12 PM, Rallavagu wrote:
> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper
>
> During updates, some nodes are going very high cpu and becomes
> unavailable. The thread dump shows the following thread is blocked 870
> threads which explains high CPU. Any clues on where to look?
>
> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
> native_blocked, daemon
>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>      at
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>      at
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>      at
> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>      at
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>
>      at
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>
>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)