You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Pat White <pa...@verizonmedia.com> on 2019/07/25 16:56:47 UTC

Debugging info for a stuck SelectHiveQL processor

Hi Folks,

Would like to ask for suggestions on debugging SelectHiveQL processors,
we've seen a very odd error mode twice now, where a SelectHiveQL processor
which had been running fine suddenly becomes "stuck". This is on 1.6.0, so
a bit dated compared to 1.9.2, but i'm still very puzzled at the lack of
error indications.

Symptom; processor is running fine, continues to report 'running' on canvas
but the input port begins to queue up and show backlogs. Stopping the
processor in the canvas reports success and shows 'stopped', but trying to
start it again gets the popup "No eligible components are selected. Please
select the components to be stopped.". Making sure the processor is clearly
selected reports same error. Only way to get it unstuck is to restart the
primary, this appears to kill the affected threads and allow the processor
to begin running again, at that point it's ok again.

Issue appears directly related to the processor itself, as opposed to say
the ConnectionPool. On that, tried restarting the ConnectionPool being
used, stop attempt hangs on the affected processor, to the point the stop
fails. Another oddity, tried stopping upstream objects to the affected
processor, they report "cannot be disabled because it is referenced by 1
components that are currently running", even though the canvas clearly
shows that processor as stopped.

What's really strange is the lack of error indications anywhere, see
nothing in the logs at all regarding the affected processor, until primary
restart. Then see the start event when the processor is coming back
online "StandardProcessScheduler
Starting SelectHiveQL id=".

Appreciate any suggestions on additional logging or other resources that
would help debug. Thanks!

patw

Re: Debugging info for a stuck SelectHiveQL processor

Posted by Pat White <pa...@verizonmedia.com>.
Great info Koji, thank you very much. Will do.

patw

On Thu, Jul 25, 2019 at 9:40 PM Koji Kawamura <ij...@gmail.com>
wrote:

> Hi Pat,
>
> I recommend getting a thread-dump when you encounter the situation next
> time.
> Thread-dump shows what each thread is doing, including the stuck
> SelectHiveQL thread.
>
> You can get thread-dump by executing:
> ${NIFI_HOME}/bin/nifi.sh dump-file-name
>
> Then thread stack traces are logged to the specified file.
> Lots of logs look like below:
> "Timer-Driven Process Thread-8" Id=71 WAITING  on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@
> 1b3abf12
>         at sun.misc.Unsafe.park(Native Method)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>
> Once you get the thread dump, please share it with us for further
> investigation.
>
> Thanks,
> Koji
>
> On Fri, Jul 26, 2019 at 1:57 AM Pat White <pa...@verizonmedia.com>
> wrote:
> >
> > Hi Folks,
> >
> > Would like to ask for suggestions on debugging SelectHiveQL processors,
> we've seen a very odd error mode twice now, where a SelectHiveQL processor
> which had been running fine suddenly becomes "stuck". This is on 1.6.0, so
> a bit dated compared to 1.9.2, but i'm still very puzzled at the lack of
> error indications.
> >
> > Symptom; processor is running fine, continues to report 'running' on
> canvas but the input port begins to queue up and show backlogs. Stopping
> the processor in the canvas reports success and shows 'stopped', but trying
> to start it again gets the popup "No eligible components are selected.
> Please select the components to be stopped.". Making sure the processor is
> clearly selected reports same error. Only way to get it unstuck is to
> restart the primary, this appears to kill the affected threads and allow
> the processor to begin running again, at that point it's ok again.
> >
> > Issue appears directly related to the processor itself, as opposed to
> say the ConnectionPool. On that, tried restarting the ConnectionPool being
> used, stop attempt hangs on the affected processor, to the point the stop
> fails. Another oddity, tried stopping upstream objects to the affected
> processor, they report "cannot be disabled because it is referenced by 1
> components that are currently running", even though the canvas clearly
> shows that processor as stopped.
> >
> > What's really strange is the lack of error indications anywhere, see
> nothing in the logs at all regarding the affected processor, until primary
> restart. Then see the start event when the processor is coming back online
> "StandardProcessScheduler Starting SelectHiveQL id=".
> >
> > Appreciate any suggestions on additional logging or other resources that
> would help debug. Thanks!
> >
> > patw
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Debugging info for a stuck SelectHiveQL processor

Posted by Koji Kawamura <ij...@gmail.com>.
Hi Pat,

I recommend getting a thread-dump when you encounter the situation next time.
Thread-dump shows what each thread is doing, including the stuck
SelectHiveQL thread.

You can get thread-dump by executing:
${NIFI_HOME}/bin/nifi.sh dump-file-name

Then thread stack traces are logged to the specified file.
Lots of logs look like below:
"Timer-Driven Process Thread-8" Id=71 WAITING  on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@
1b3abf12
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)

Once you get the thread dump, please share it with us for further investigation.

Thanks,
Koji

On Fri, Jul 26, 2019 at 1:57 AM Pat White <pa...@verizonmedia.com> wrote:
>
> Hi Folks,
>
> Would like to ask for suggestions on debugging SelectHiveQL processors, we've seen a very odd error mode twice now, where a SelectHiveQL processor which had been running fine suddenly becomes "stuck". This is on 1.6.0, so a bit dated compared to 1.9.2, but i'm still very puzzled at the lack of error indications.
>
> Symptom; processor is running fine, continues to report 'running' on canvas but the input port begins to queue up and show backlogs. Stopping the processor in the canvas reports success and shows 'stopped', but trying to start it again gets the popup "No eligible components are selected. Please select the components to be stopped.". Making sure the processor is clearly selected reports same error. Only way to get it unstuck is to restart the primary, this appears to kill the affected threads and allow the processor to begin running again, at that point it's ok again.
>
> Issue appears directly related to the processor itself, as opposed to say the ConnectionPool. On that, tried restarting the ConnectionPool being used, stop attempt hangs on the affected processor, to the point the stop fails. Another oddity, tried stopping upstream objects to the affected processor, they report "cannot be disabled because it is referenced by 1 components that are currently running", even though the canvas clearly shows that processor as stopped.
>
> What's really strange is the lack of error indications anywhere, see nothing in the logs at all regarding the affected processor, until primary restart. Then see the start event when the processor is coming back online "StandardProcessScheduler Starting SelectHiveQL id=".
>
> Appreciate any suggestions on additional logging or other resources that would help debug. Thanks!
>
> patw
>
>
>
>
>
>
>
>