You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Russell Bateman <ru...@windofkeltia.com> on 2017/01/23 14:54:13 UTC

Cadmium rods?

Can we get cadmium rods?

Our down-streamers are complaining, and I have some experience through 
testing myself, of NiFi getting into molasses where the only solution is 
to bounce it. Here are some comments from my users that I hope are helpful.

    "I'm getting really burned out over the issue of when NiFi
    processors get stuck, you can't get them to stop and the only
    solution is |# systemctl restart ps-nifi|. I actually keep a window
    open in tmux where I run this command all the time so I can just go
    to that window and press up enter to restart it again."

    "I have a /DistributeLoad/ processor that was sitting there doing
    nothing at all even though it said it was running. I tried
    refreshing for a while, and after several minutes I finally tried
    stopping the processor to see if stopping and starting it again
    would help.

    "So I told it to stop, then suddenly NiFi refreshed (even though it
    had been refusing to refresh for several minutes. Seems like it does
    whatever it wants, when it feels like it). Then it turned out that
    that processor actually HAD been running, I just couldn't see it.
    Now I want to start it again, but I can't, because it has a couple
    of stuck threads. So, I resort to |# systemctl restart ps-nifi|. I
    know the purpose of this UI is to give us visibility into the ETL
    process, but if it only gives us visibility when it feels like it,
    and then it only stops a process if it feels like it, its really
    annoying."

(Of course, some of this is "point of view" and a lack of understanding 
what's really going on.)

What we do is ingest millions of medical documents including plain-text 
transcripts, HL 7 pipe messages, X12 messages and CDAs (CCDs and CCDAs). 
These are analyzed for all sorts of important data, transformed into an 
intermediate format before being committed to a search engine and 
database for retrieval.

We've written many dozen custom processors and use many of those that 
come with NiFi to perform this ETL over the last year or so, most very 
small, and are very happy with the visibility NiFi gives us into what 
used to be a pretty opaque and hard-to-understand ETL component. Our 
custom processors range from some very specific one doing document 
analysis and involving regular expressions to more general ones that do 
HL7, XML, X12, etc. parsing, to invoking Tika and cTAKES. This all works 
very well in theory, but as you can see, there's considerable trouble 
and we're having a difficult time tuning, using careful back-pressure, etc.

What we think we need, and we're eager for opinions here, is for NiFi to 
dedicate a thread to the UI such that bouncing NiFi is no longer the 
only option. We want to reach it and shut things down without the UI 
being held hostage to threads burdened or hung with tasks that are far 
from getting back to it. I image being able to right-click a process 
group and stop it like shoving cadmium rods into a radioactive pile to 
scram NiFi, examine what's going on, find and tune the parts in our flow 
that we had not before understood were problematic. (Of course, what 
I've just said probably betrays a lack of understanding on my part too.)

Also, in my observation, when the quantity of files and subdirectories 
under /content_repository/ gets too big, it seems to me that the only 
thing I can do is to smoke them all before starting NiFi back up.

I've been running the Java Flight Recorder attempting to spy on our NiFi 
flows remotely using Java Mission Control. This isn't easily done either 
because of how JFR works and my spyglass goes dark just as our users 
lose UI response.

Thoughts?

Russ

Re: Cadmium rods?

Posted by Russell Bateman <ru...@windofkeltia.com>.

Joe,

Our production ("implementation") guys are running on 0.7.1 which is 
distributed with our product. Our next product release will probably see 
them running on 1.1.1 or later. I typically develop under the latest 
NiFi release unless I'm specifically paying attention to an issue.

I was hoping that JFR would help; I'll be instead examining the 
thread-dump route the next time I get the chance. JFR does, however, 
generate a wealth of interesting data on what's generally happening, 
just not went catastrophically wrong. We were hoping that we'd find 
something easy like learning that provenance was too expensive to keep 
running

Yes, clearing out the repository is a sign of some impatience and 
frustration, I admit. There's no progress bar: one wonders whether 
goodness is happening or it's just hung.

I will be tackling using your suggestions/elucidations over the next 
while. Profuse thanks for these!

Russ

On 01/23/2017 10:26 AM, Joe Witt wrote:
> Russ,
>
> It will be very important to understand what version of NiFi you're
> referencing on threads like this.  But let me restate the core
> scenarios you're describing as I understand it.
>
> Problem Statement #1
>
> We have at times processors which appear to be stuck in that they are
> no longer processing data or performing their function and when we
> attempt to stop them they never appear to stop.  Since there are
> active threads shown NiFi is also not allowing us to restart those
> processors.  To overcome this we've become accustomed to simply
> restarting NiFi whenever it happens.
>
> Discussion for statement #1:
>
> It is important to point out that stuck processors almost entirely
> indicates there is a bug in that processor in that it can get itself
> into a situation where it acquires a thread from the controller and
> for one reason or another it will never relinquish it or it won't
> relinquish it for quite a long time such that to the user it feels
> effectively stuck.  So then we must consider two primary actors here.
> So the bottom line here is that the conditions which make this
> happen/possible need to be solved.  The things we can do to help the
> user here are simply workarounds to lessen the impact at the time but
> ultimately a stuck thread is a stuck thread and the case that could
> allow that must be found and fixed.
>
> So thinking for the developer we can do the following:
>
> 1) Help the developer detect such cases during initial development.
> Improve the mock testing framework to better provide tests for
> lifecycle cases.  Most of the time these scenarios are brought about
> by improper handling of sockets and tests don't often exercise those
> cases.  But in any event we can improve our mock framework to better
> test lifecycle handling.
>
> 2) Help the developer gather diagnostics/data about the condition in
> which they are stuck.  Today this is done via obtaining thread dumps.
> By capturing a thread dump and waiting a bit then capturing another it
> is often fairly obvious which thread represents the stuck thread.  In
> cases of live lock this can be trickier but still generally clear.  We
> could explore some idea whereby if the framework detects a
> component/processor thread not doing anything productive for a period
> of time it will automatically obtain a series of thread dumps and
> capture that as diagnostics/package data for that component which will
> aid the developer.  This is non trivial to do but certainly a
> reasonable step to take.
>
> Have you done the thread dump route and identified the root cause of
> any of the stuck thread conditions?  Which processors is this
> happening in?  One of the processors you mentioned was DistributeLoad.
> By default that processor will stop distributing flow files if any of
> its relationships are in a back pressure condition.  You can switch to
> 'next available' which means it will distribute data in a flowing
> fashion whereby data will go wherever there is no back pressure.  In
> the latest release now too you get visual indicators of back pressure
> which greatly help the user in understanding what is happening.
>
> So thinking for the user we can do the following:
>
> 1) Help the user identify stuck threads/components and alert them to
> it.  Awareness is step one and potentially having an early alert to
> the condition will help correlate to a cause which can aid root
> resolution.
>
> 2) Help the user be able to 'kill/restart that component'.  What is
> important to point out here is that in many cases we cannot get the
> thread back.  But what we could certainly do is quarantine/isolate
> that thread and give the component/processor a new one to work with.
> Of course the condition/code that allows that to happen is still
> present and will likely occur again but at least this gives the user
> some recourse while the developers work a solution.
>
> +++
>
> Problem Statement #2
>
> When the content repository in NiFi has a large quantity of files
> contained it appears the only effective mechanism to get NiFi to
> restart is to blow away the contents first.
>
> Discussion for statement #2:
>
> This is presumably related to long startup times due to a large
> content repository needing to be evaluated.  This I believe should be
> far more efficient in recent releases.  Can you advise what release
> you're running on?  How large of a content repository in terms of file
> count are you referring to?
>
> Thanks
> Joe
>
> On Mon, Jan 23, 2017 at 9:54 AM, Russell Bateman <ru...@windofkeltia.com> wrote:
>> Can we get cadmium rods?
>>
>> Our down-streamers are complaining, and I have some experience through
>> testing myself, of NiFi getting into molasses where the only solution is to
>> bounce it. Here are some comments from my users that I hope are helpful.
>>
>>     "I'm getting really burned out over the issue of when NiFi
>>     processors get stuck, you can't get them to stop and the only
>>     solution is |# systemctl restart ps-nifi|. I actually keep a window
>>     open in tmux where I run this command all the time so I can just go
>>     to that window and press up enter to restart it again."
>>
>>     "I have a /DistributeLoad/ processor that was sitting there doing
>>     nothing at all even though it said it was running. I tried
>>     refreshing for a while, and after several minutes I finally tried
>>     stopping the processor to see if stopping and starting it again
>>     would help.
>>
>>     "So I told it to stop, then suddenly NiFi refreshed (even though it
>>     had been refusing to refresh for several minutes. Seems like it does
>>     whatever it wants, when it feels like it). Then it turned out that
>>     that processor actually HAD been running, I just couldn't see it.
>>     Now I want to start it again, but I can't, because it has a couple
>>     of stuck threads. So, I resort to |# systemctl restart ps-nifi|. I
>>     know the purpose of this UI is to give us visibility into the ETL
>>     process, but if it only gives us visibility when it feels like it,
>>     and then it only stops a process if it feels like it, its really
>>     annoying."
>>
>> (Of course, some of this is "point of view" and a lack of understanding
>> what's really going on.)
>>
>> What we do is ingest millions of medical documents including plain-text
>> transcripts, HL 7 pipe messages, X12 messages and CDAs (CCDs and CCDAs).
>> These are analyzed for all sorts of important data, transformed into an
>> intermediate format before being committed to a search engine and database
>> for retrieval.
>>
>> We've written many dozen custom processors and use many of those that come
>> with NiFi to perform this ETL over the last year or so, most very small, and
>> are very happy with the visibility NiFi gives us into what used to be a
>> pretty opaque and hard-to-understand ETL component. Our custom processors
>> range from some very specific one doing document analysis and involving
>> regular expressions to more general ones that do HL7, XML, X12, etc.
>> parsing, to invoking Tika and cTAKES. This all works very well in theory,
>> but as you can see, there's considerable trouble and we're having a
>> difficult time tuning, using careful back-pressure, etc.
>>
>> What we think we need, and we're eager for opinions here, is for NiFi to
>> dedicate a thread to the UI such that bouncing NiFi is no longer the only
>> option. We want to reach it and shut things down without the UI being held
>> hostage to threads burdened or hung with tasks that are far from getting
>> back to it. I image being able to right-click a process group and stop it
>> like shoving cadmium rods into a radioactive pile to scram NiFi, examine
>> what's going on, find and tune the parts in our flow that we had not before
>> understood were problematic. (Of course, what I've just said probably
>> betrays a lack of understanding on my part too.)
>>
>> Also, in my observation, when the quantity of files and subdirectories under
>> /content_repository/ gets too big, it seems to me that the only thing I can
>> do is to smoke them all before starting NiFi back up.
>>
>> I've been running the Java Flight Recorder attempting to spy on our NiFi
>> flows remotely using Java Mission Control. This isn't easily done either
>> because of how JFR works and my spyglass goes dark just as our users lose UI
>> response.
>>
>> Thoughts?
>>
>> Russ
>>

Re: Cadmium rods?

Posted by Joe Witt <jo...@gmail.com>.

Russ,

It will be very important to understand what version of NiFi you're
referencing on threads like this.  But let me restate the core
scenarios you're describing as I understand it.

Problem Statement #1

We have at times processors which appear to be stuck in that they are
no longer processing data or performing their function and when we
attempt to stop them they never appear to stop.  Since there are
active threads shown NiFi is also not allowing us to restart those
processors.  To overcome this we've become accustomed to simply
restarting NiFi whenever it happens.

Discussion for statement #1:

It is important to point out that stuck processors almost entirely
indicates there is a bug in that processor in that it can get itself
into a situation where it acquires a thread from the controller and
for one reason or another it will never relinquish it or it won't
relinquish it for quite a long time such that to the user it feels
effectively stuck.  So then we must consider two primary actors here.
So the bottom line here is that the conditions which make this
happen/possible need to be solved.  The things we can do to help the
user here are simply workarounds to lessen the impact at the time but
ultimately a stuck thread is a stuck thread and the case that could
allow that must be found and fixed.

So thinking for the developer we can do the following:

1) Help the developer detect such cases during initial development.
Improve the mock testing framework to better provide tests for
lifecycle cases.  Most of the time these scenarios are brought about
by improper handling of sockets and tests don't often exercise those
cases.  But in any event we can improve our mock framework to better
test lifecycle handling.

2) Help the developer gather diagnostics/data about the condition in
which they are stuck.  Today this is done via obtaining thread dumps.
By capturing a thread dump and waiting a bit then capturing another it
is often fairly obvious which thread represents the stuck thread.  In
cases of live lock this can be trickier but still generally clear.  We
could explore some idea whereby if the framework detects a
component/processor thread not doing anything productive for a period
of time it will automatically obtain a series of thread dumps and
capture that as diagnostics/package data for that component which will
aid the developer.  This is non trivial to do but certainly a
reasonable step to take.

Have you done the thread dump route and identified the root cause of
any of the stuck thread conditions?  Which processors is this
happening in?  One of the processors you mentioned was DistributeLoad.
By default that processor will stop distributing flow files if any of
its relationships are in a back pressure condition.  You can switch to
'next available' which means it will distribute data in a flowing
fashion whereby data will go wherever there is no back pressure.  In
the latest release now too you get visual indicators of back pressure
which greatly help the user in understanding what is happening.

So thinking for the user we can do the following:

1) Help the user identify stuck threads/components and alert them to
it.  Awareness is step one and potentially having an early alert to
the condition will help correlate to a cause which can aid root
resolution.

2) Help the user be able to 'kill/restart that component'.  What is
important to point out here is that in many cases we cannot get the
thread back.  But what we could certainly do is quarantine/isolate
that thread and give the component/processor a new one to work with.
Of course the condition/code that allows that to happen is still
present and will likely occur again but at least this gives the user
some recourse while the developers work a solution.

+++

Problem Statement #2

When the content repository in NiFi has a large quantity of files
contained it appears the only effective mechanism to get NiFi to
restart is to blow away the contents first.

Discussion for statement #2:

This is presumably related to long startup times due to a large
content repository needing to be evaluated.  This I believe should be
far more efficient in recent releases.  Can you advise what release
you're running on?  How large of a content repository in terms of file
count are you referring to?

Thanks
Joe

On Mon, Jan 23, 2017 at 9:54 AM, Russell Bateman <ru...@windofkeltia.com> wrote:
> Can we get cadmium rods?
>
> Our down-streamers are complaining, and I have some experience through
> testing myself, of NiFi getting into molasses where the only solution is to
> bounce it. Here are some comments from my users that I hope are helpful.
>
>    "I'm getting really burned out over the issue of when NiFi
>    processors get stuck, you can't get them to stop and the only
>    solution is |# systemctl restart ps-nifi|. I actually keep a window
>    open in tmux where I run this command all the time so I can just go
>    to that window and press up enter to restart it again."
>
>    "I have a /DistributeLoad/ processor that was sitting there doing
>    nothing at all even though it said it was running. I tried
>    refreshing for a while, and after several minutes I finally tried
>    stopping the processor to see if stopping and starting it again
>    would help.
>
>    "So I told it to stop, then suddenly NiFi refreshed (even though it
>    had been refusing to refresh for several minutes. Seems like it does
>    whatever it wants, when it feels like it). Then it turned out that
>    that processor actually HAD been running, I just couldn't see it.
>    Now I want to start it again, but I can't, because it has a couple
>    of stuck threads. So, I resort to |# systemctl restart ps-nifi|. I
>    know the purpose of this UI is to give us visibility into the ETL
>    process, but if it only gives us visibility when it feels like it,
>    and then it only stops a process if it feels like it, its really
>    annoying."
>
> (Of course, some of this is "point of view" and a lack of understanding
> what's really going on.)
>
> What we do is ingest millions of medical documents including plain-text
> transcripts, HL 7 pipe messages, X12 messages and CDAs (CCDs and CCDAs).
> These are analyzed for all sorts of important data, transformed into an
> intermediate format before being committed to a search engine and database
> for retrieval.
>
> We've written many dozen custom processors and use many of those that come
> with NiFi to perform this ETL over the last year or so, most very small, and
> are very happy with the visibility NiFi gives us into what used to be a
> pretty opaque and hard-to-understand ETL component. Our custom processors
> range from some very specific one doing document analysis and involving
> regular expressions to more general ones that do HL7, XML, X12, etc.
> parsing, to invoking Tika and cTAKES. This all works very well in theory,
> but as you can see, there's considerable trouble and we're having a
> difficult time tuning, using careful back-pressure, etc.
>
> What we think we need, and we're eager for opinions here, is for NiFi to
> dedicate a thread to the UI such that bouncing NiFi is no longer the only
> option. We want to reach it and shut things down without the UI being held
> hostage to threads burdened or hung with tasks that are far from getting
> back to it. I image being able to right-click a process group and stop it
> like shoving cadmium rods into a radioactive pile to scram NiFi, examine
> what's going on, find and tune the parts in our flow that we had not before
> understood were problematic. (Of course, what I've just said probably
> betrays a lack of understanding on my part too.)
>
> Also, in my observation, when the quantity of files and subdirectories under
> /content_repository/ gets too big, it seems to me that the only thing I can
> do is to smoke them all before starting NiFi back up.
>
> I've been running the Java Flight Recorder attempting to spy on our NiFi
> flows remotely using Java Mission Control. This isn't easily done either
> because of how JFR works and my spyglass goes dark just as our users lose UI
> response.
>
> Thoughts?
>
> Russ
>