You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Mikhail Sosonkin <mi...@synack.com> on 2017/02/17 02:43:38 UTC

Deadlocks after upgrade from 0.6.1 to 1.1.1

Hello,

Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was
working well. However, a few hours later none of the processors were
showing any activity. Then, I tried restarting nifi which caused some
flowfiles to get corrupted evidenced by exceptions thrown in the
nifi-app.log, however the processors still continue to produce no activity.
Next, I stop the service and delete all state (content_repository
database_repository flowfile_repository provenance_repository work). Then
the processors start working for a few hours (maybe a day) until the
deadlock occurs again.

So, this cycle continues where I have to periodically reset the service and
delete the state to get things moving. Obviously, that's not great. I'll
note that the flow.xml file has been changed, as I added/removed
processors, by the new version of nifi but 95% of the flow configuration is
the same as before the upgrade. So, I'm wondering if there is a
configuration setting that causes these deadlocks.

What I've been able to observe is that the deadlock is "gradual" in that my
flow usually takes about 4-5 threads to execute. The deadlock causes the
worker threads to max out at the limit and I'm not even able to stop any
processors or list queues. I also, have not seen this behavior in a fresh
install of Nifi where the flow.xml would start out empty.

Can you give me some advise on what to do about this? Would the problem be
resolved if I manually rebuild the flow with the new version of Nifi (not
looking forward to that)?

Much appreciated.

Mike.

-- 
This email may contain material that is confidential for the sole use of 
the intended recipient(s).  Any review, reliance or distribution or 
disclosure by others without express permission is strictly prohibited.  If 
you are not the intended recipient, please contact the sender and delete 
all copies of this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Joe Witt <jo...@gmail.com>.
k - very happy to mention there is a PR out that is under review which
will offer an alternative provenance implementation.  Sustained high
rate testing has shown out of the box 2.5x improvement with immediate
indexing/results so all kinds of fun there.

On Fri, Feb 17, 2017 at 12:13 AM, Mikhail Sosonkin <mi...@synack.com> wrote:
> Let me look through the log. I've not seen anything too weird there before,
> but I'll check again. In the UI, I quite normally see flows getting slowed
> because provenance can't keep up. But it hasn't been too slow for us, so I
> didn't pay much attention.
>
> On Fri, Feb 17, 2017 at 12:03 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>> when I said one more thing i definitely lied.
>>
>> Can you see anything in the UI indicating provenance backpressure is
>> being applied and if you look in the app log is there anything
>> interesting that isn't too sensitive to share?
>>
>> On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:
>> > Mike
>> >
>> > One more thing...can you please grab a couple more thread dumps for us
>> > with 5 to 10 mins between?
>> >
>> > I don't see a deadlock but do suspect either just crazy slow IO going
>> > on or a possible livelock.  The thread dump will help narrow that down
>> > a bit.
>> >
>> > Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>> > system too please.
>> >
>> > Thanks
>> > Joe
>> >
>> > On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
>> >> Mike,
>> >>
>> >> No need for more info.  Heap/GC looks beautiful.
>> >>
>> >> The thread dump however, shows some problems.  The provenance
>> >> repository is locked up.  Numerous threads are sitting here
>> >>
>> >> at
>> >> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>> >> at
>> >> org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>> >>
>> >> This means these are processors committing their sessions and updating
>> >> provenance but they're waiting on a readlock to provenance.  This lock
>> >> cannot be obtained because a provenance maintenance thread is
>> >> attempting to purge old events and cannot.
>> >>
>> >> I recall us having addressed this so am looking to see when that was
>> >> addressed.  If provenance is not critical for you right now you can
>> >> swap out the persistent implementation with the volatile provenance
>> >> repository.  In nifi.properties change this line
>> >>
>> >>
>> >> nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>> >>
>> >> to
>> >>
>> >>
>> >> nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>> >>
>> >> The behavior reminds me of this issue which was fixed in 1.x
>> >> https://issues.apache.org/jira/browse/NIFI-2395
>> >>
>> >> Need to dig into this more...
>> >>
>> >> Thanks
>> >> Joe
>> >>
>> >> On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com>
>> >> wrote:
>> >>> Hi Joe,
>> >>>
>> >>> Thank you for your quick response. The system is currently in the
>> >>> deadlock
>> >>> state with 10 worker threads spinning. So, I'll gather the info you
>> >>> requested.
>> >>>
>> >>> - The available space on the partition is 223G free of 500G (same as
>> >>> was
>> >>> available for 0.6.1)
>> >>> - java.arg.3=-Xmx4096m in bootstrap.conf
>> >>> - thread dump and jstats are here
>> >>> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>> >>>
>> >>> Unfortunately, it's hard to predict when the decay starts and it takes
>> >>> too
>> >>> long to have to monitor the system manually. However, if you still
>> >>> need,
>> >>> after seeing the attached dumps, the thread dumps while it decays I
>> >>> can set
>> >>> up a timer script.
>> >>>
>> >>> Let me know if you need any more info.
>> >>>
>> >>> Thanks,
>> >>> Mike.
>> >>>
>> >>>
>> >>> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
>> >>>>
>> >>>> Mike,
>> >>>>
>> >>>> Can you capture a series of thread dumps as the gradual decay occurs
>> >>>> and signal at what point they were generated specifically calling out
>> >>>> the "now the system is doing nothing" point.  Can you check for space
>> >>>> available on the system during these times as well.  Also, please
>> >>>> advise on the behavior of the heap/garbage collection.  Often (not
>> >>>> always) a gradual decay in performance can suggest an issue with GC
>> >>>> as
>> >>>> you know.  Can you run something like
>> >>>>
>> >>>> jstat -gcutil -h5 <pid> 1000
>> >>>>
>> >>>> And capture those rules in these chunks as well.
>> >>>>
>> >>>> This would give us a pretty good picture of the health of the system/
>> >>>> and JVM around these times.  It is probably too much for the mailing
>> >>>> list for the info so feel free to create a JIRA for this and put
>> >>>> attachments there or link to gists in github/etc.
>> >>>>
>> >>>> Pretty confident we can get to the bottom of what you're seeing
>> >>>> quickly.
>> >>>>
>> >>>> Thanks
>> >>>> Joe
>> >>>>
>> >>>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin
>> >>>> <mi...@synack.com>
>> >>>> wrote:
>> >>>> > Hello,
>> >>>> >
>> >>>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first
>> >>>> > everything was
>> >>>> > working well. However, a few hours later none of the processors
>> >>>> > were
>> >>>> > showing
>> >>>> > any activity. Then, I tried restarting nifi which caused some
>> >>>> > flowfiles
>> >>>> > to
>> >>>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> >>>> > however
>> >>>> > the processors still continue to produce no activity. Next, I stop
>> >>>> > the
>> >>>> > service and delete all state (content_repository
>> >>>> > database_repository
>> >>>> > flowfile_repository provenance_repository work). Then the
>> >>>> > processors
>> >>>> > start
>> >>>> > working for a few hours (maybe a day) until the deadlock occurs
>> >>>> > again.
>> >>>> >
>> >>>> > So, this cycle continues where I have to periodically reset the
>> >>>> > service
>> >>>> > and
>> >>>> > delete the state to get things moving. Obviously, that's not great.
>> >>>> > I'll
>> >>>> > note that the flow.xml file has been changed, as I added/removed
>> >>>> > processors,
>> >>>> > by the new version of nifi but 95% of the flow configuration is the
>> >>>> > same
>> >>>> > as
>> >>>> > before the upgrade. So, I'm wondering if there is a configuration
>> >>>> > setting
>> >>>> > that causes these deadlocks.
>> >>>> >
>> >>>> > What I've been able to observe is that the deadlock is "gradual" in
>> >>>> > that
>> >>>> > my
>> >>>> > flow usually takes about 4-5 threads to execute. The deadlock
>> >>>> > causes the
>> >>>> > worker threads to max out at the limit and I'm not even able to
>> >>>> > stop any
>> >>>> > processors or list queues. I also, have not seen this behavior in a
>> >>>> > fresh
>> >>>> > install of Nifi where the flow.xml would start out empty.
>> >>>> >
>> >>>> > Can you give me some advise on what to do about this? Would the
>> >>>> > problem
>> >>>> > be
>> >>>> > resolved if I manually rebuild the flow with the new version of
>> >>>> > Nifi
>> >>>> > (not
>> >>>> > looking forward to that)?
>> >>>> >
>> >>>> > Much appreciated.
>> >>>> >
>> >>>> > Mike.
>> >>>> >
>> >>>> > This email may contain material that is confidential for the sole
>> >>>> > use of
>> >>>> > the
>> >>>> > intended recipient(s).  Any review, reliance or distribution or
>> >>>> > disclosure
>> >>>> > by others without express permission is strictly prohibited.  If
>> >>>> > you are
>> >>>> > not
>> >>>> > the intended recipient, please contact the sender and delete all
>> >>>> > copies
>> >>>> > of
>> >>>> > this message.
>> >>>
>> >>>
>> >>>
>> >>> This email may contain material that is confidential for the sole use
>> >>> of the
>> >>> intended recipient(s).  Any review, reliance or distribution or
>> >>> disclosure
>> >>> by others without express permission is strictly prohibited.  If you
>> >>> are not
>> >>> the intended recipient, please contact the sender and delete all
>> >>> copies of
>> >>> this message.
>
>
>
> This email may contain material that is confidential for the sole use of the
> intended recipient(s).  Any review, reliance or distribution or disclosure
> by others without express permission is strictly prohibited.  If you are not
> the intended recipient, please contact the sender and delete all copies of
> this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Mikhail Sosonkin <mi...@synack.com>.
Let me look through the log. I've not seen anything too weird there before,
but I'll check again. In the UI, I quite normally see flows getting slowed
because provenance can't keep up. But it hasn't been too slow for us, so I
didn't pay much attention.

On Fri, Feb 17, 2017 at 12:03 AM, Joe Witt <jo...@gmail.com> wrote:

> when I said one more thing i definitely lied.
>
> Can you see anything in the UI indicating provenance backpressure is
> being applied and if you look in the app log is there anything
> interesting that isn't too sensitive to share?
>
> On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:
> > Mike
> >
> > One more thing...can you please grab a couple more thread dumps for us
> > with 5 to 10 mins between?
> >
> > I don't see a deadlock but do suspect either just crazy slow IO going
> > on or a possible livelock.  The thread dump will help narrow that down
> > a bit.
> >
> > Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
> > system too please.
> >
> > Thanks
> > Joe
> >
> > On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
> >> Mike,
> >>
> >> No need for more info.  Heap/GC looks beautiful.
> >>
> >> The thread dump however, shows some problems.  The provenance
> >> repository is locked up.  Numerous threads are sitting here
> >>
> >> at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(
> ReentrantReadWriteLock.java:727)
> >> at org.apache.nifi.provenance.PersistentProvenanceRepository
> .persistRecord(PersistentProvenanceRepository.java:757)
> >>
> >> This means these are processors committing their sessions and updating
> >> provenance but they're waiting on a readlock to provenance.  This lock
> >> cannot be obtained because a provenance maintenance thread is
> >> attempting to purge old events and cannot.
> >>
> >> I recall us having addressed this so am looking to see when that was
> >> addressed.  If provenance is not critical for you right now you can
> >> swap out the persistent implementation with the volatile provenance
> >> repository.  In nifi.properties change this line
> >>
> >> nifi.provenance.repository.implementation=org.apache.nifi.provenance.
> PersistentProvenanceRepository
> >>
> >> to
> >>
> >> nifi.provenance.repository.implementation=org.apache.nifi.provenance.
> VolatileProvenanceRepository
> >>
> >> The behavior reminds me of this issue which was fixed in 1.x
> >> https://issues.apache.org/jira/browse/NIFI-2395
> >>
> >> Need to dig into this more...
> >>
> >> Thanks
> >> Joe
> >>
> >> On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com>
> wrote:
> >>> Hi Joe,
> >>>
> >>> Thank you for your quick response. The system is currently in the
> deadlock
> >>> state with 10 worker threads spinning. So, I'll gather the info you
> >>> requested.
> >>>
> >>> - The available space on the partition is 223G free of 500G (same as
> was
> >>> available for 0.6.1)
> >>> - java.arg.3=-Xmx4096m in bootstrap.conf
> >>> - thread dump and jstats are here
> >>> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
> >>>
> >>> Unfortunately, it's hard to predict when the decay starts and it takes
> too
> >>> long to have to monitor the system manually. However, if you still
> need,
> >>> after seeing the attached dumps, the thread dumps while it decays I
> can set
> >>> up a timer script.
> >>>
> >>> Let me know if you need any more info.
> >>>
> >>> Thanks,
> >>> Mike.
> >>>
> >>>
> >>> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
> >>>>
> >>>> Mike,
> >>>>
> >>>> Can you capture a series of thread dumps as the gradual decay occurs
> >>>> and signal at what point they were generated specifically calling out
> >>>> the "now the system is doing nothing" point.  Can you check for space
> >>>> available on the system during these times as well.  Also, please
> >>>> advise on the behavior of the heap/garbage collection.  Often (not
> >>>> always) a gradual decay in performance can suggest an issue with GC as
> >>>> you know.  Can you run something like
> >>>>
> >>>> jstat -gcutil -h5 <pid> 1000
> >>>>
> >>>> And capture those rules in these chunks as well.
> >>>>
> >>>> This would give us a pretty good picture of the health of the system/
> >>>> and JVM around these times.  It is probably too much for the mailing
> >>>> list for the info so feel free to create a JIRA for this and put
> >>>> attachments there or link to gists in github/etc.
> >>>>
> >>>> Pretty confident we can get to the bottom of what you're seeing
> quickly.
> >>>>
> >>>> Thanks
> >>>> Joe
> >>>>
> >>>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mikhail@synack.com
> >
> >>>> wrote:
> >>>> > Hello,
> >>>> >
> >>>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first
> everything was
> >>>> > working well. However, a few hours later none of the processors were
> >>>> > showing
> >>>> > any activity. Then, I tried restarting nifi which caused some
> flowfiles
> >>>> > to
> >>>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
> >>>> > however
> >>>> > the processors still continue to produce no activity. Next, I stop
> the
> >>>> > service and delete all state (content_repository database_repository
> >>>> > flowfile_repository provenance_repository work). Then the processors
> >>>> > start
> >>>> > working for a few hours (maybe a day) until the deadlock occurs
> again.
> >>>> >
> >>>> > So, this cycle continues where I have to periodically reset the
> service
> >>>> > and
> >>>> > delete the state to get things moving. Obviously, that's not great.
> I'll
> >>>> > note that the flow.xml file has been changed, as I added/removed
> >>>> > processors,
> >>>> > by the new version of nifi but 95% of the flow configuration is the
> same
> >>>> > as
> >>>> > before the upgrade. So, I'm wondering if there is a configuration
> >>>> > setting
> >>>> > that causes these deadlocks.
> >>>> >
> >>>> > What I've been able to observe is that the deadlock is "gradual" in
> that
> >>>> > my
> >>>> > flow usually takes about 4-5 threads to execute. The deadlock
> causes the
> >>>> > worker threads to max out at the limit and I'm not even able to
> stop any
> >>>> > processors or list queues. I also, have not seen this behavior in a
> >>>> > fresh
> >>>> > install of Nifi where the flow.xml would start out empty.
> >>>> >
> >>>> > Can you give me some advise on what to do about this? Would the
> problem
> >>>> > be
> >>>> > resolved if I manually rebuild the flow with the new version of Nifi
> >>>> > (not
> >>>> > looking forward to that)?
> >>>> >
> >>>> > Much appreciated.
> >>>> >
> >>>> > Mike.
> >>>> >
> >>>> > This email may contain material that is confidential for the sole
> use of
> >>>> > the
> >>>> > intended recipient(s).  Any review, reliance or distribution or
> >>>> > disclosure
> >>>> > by others without express permission is strictly prohibited.  If
> you are
> >>>> > not
> >>>> > the intended recipient, please contact the sender and delete all
> copies
> >>>> > of
> >>>> > this message.
> >>>
> >>>
> >>>
> >>> This email may contain material that is confidential for the sole use
> of the
> >>> intended recipient(s).  Any review, reliance or distribution or
> disclosure
> >>> by others without express permission is strictly prohibited.  If you
> are not
> >>> the intended recipient, please contact the sender and delete all
> copies of
> >>> this message.
>

-- 
This email may contain material that is confidential for the sole use of 
the intended recipient(s).  Any review, reliance or distribution or 
disclosure by others without express permission is strictly prohibited.  If 
you are not the intended recipient, please contact the sender and delete 
all copies of this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Joe Witt <jo...@gmail.com>.
when I said one more thing i definitely lied.

Can you see anything in the UI indicating provenance backpressure is
being applied and if you look in the app log is there anything
interesting that isn't too sensitive to share?

On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:
> Mike
>
> One more thing...can you please grab a couple more thread dumps for us
> with 5 to 10 mins between?
>
> I don't see a deadlock but do suspect either just crazy slow IO going
> on or a possible livelock.  The thread dump will help narrow that down
> a bit.
>
> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
> system too please.
>
> Thanks
> Joe
>
> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
>> Mike,
>>
>> No need for more info.  Heap/GC looks beautiful.
>>
>> The thread dump however, shows some problems.  The provenance
>> repository is locked up.  Numerous threads are sitting here
>>
>> at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>> at org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>>
>> This means these are processors committing their sessions and updating
>> provenance but they're waiting on a readlock to provenance.  This lock
>> cannot be obtained because a provenance maintenance thread is
>> attempting to purge old events and cannot.
>>
>> I recall us having addressed this so am looking to see when that was
>> addressed.  If provenance is not critical for you right now you can
>> swap out the persistent implementation with the volatile provenance
>> repository.  In nifi.properties change this line
>>
>> nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>>
>> to
>>
>> nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>>
>> The behavior reminds me of this issue which was fixed in 1.x
>> https://issues.apache.org/jira/browse/NIFI-2395
>>
>> Need to dig into this more...
>>
>> Thanks
>> Joe
>>
>> On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com> wrote:
>>> Hi Joe,
>>>
>>> Thank you for your quick response. The system is currently in the deadlock
>>> state with 10 worker threads spinning. So, I'll gather the info you
>>> requested.
>>>
>>> - The available space on the partition is 223G free of 500G (same as was
>>> available for 0.6.1)
>>> - java.arg.3=-Xmx4096m in bootstrap.conf
>>> - thread dump and jstats are here
>>> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>>>
>>> Unfortunately, it's hard to predict when the decay starts and it takes too
>>> long to have to monitor the system manually. However, if you still need,
>>> after seeing the attached dumps, the thread dumps while it decays I can set
>>> up a timer script.
>>>
>>> Let me know if you need any more info.
>>>
>>> Thanks,
>>> Mike.
>>>
>>>
>>> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
>>>>
>>>> Mike,
>>>>
>>>> Can you capture a series of thread dumps as the gradual decay occurs
>>>> and signal at what point they were generated specifically calling out
>>>> the "now the system is doing nothing" point.  Can you check for space
>>>> available on the system during these times as well.  Also, please
>>>> advise on the behavior of the heap/garbage collection.  Often (not
>>>> always) a gradual decay in performance can suggest an issue with GC as
>>>> you know.  Can you run something like
>>>>
>>>> jstat -gcutil -h5 <pid> 1000
>>>>
>>>> And capture those rules in these chunks as well.
>>>>
>>>> This would give us a pretty good picture of the health of the system/
>>>> and JVM around these times.  It is probably too much for the mailing
>>>> list for the info so feel free to create a JIRA for this and put
>>>> attachments there or link to gists in github/etc.
>>>>
>>>> Pretty confident we can get to the bottom of what you're seeing quickly.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mi...@synack.com>
>>>> wrote:
>>>> > Hello,
>>>> >
>>>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was
>>>> > working well. However, a few hours later none of the processors were
>>>> > showing
>>>> > any activity. Then, I tried restarting nifi which caused some flowfiles
>>>> > to
>>>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>>>> > however
>>>> > the processors still continue to produce no activity. Next, I stop the
>>>> > service and delete all state (content_repository database_repository
>>>> > flowfile_repository provenance_repository work). Then the processors
>>>> > start
>>>> > working for a few hours (maybe a day) until the deadlock occurs again.
>>>> >
>>>> > So, this cycle continues where I have to periodically reset the service
>>>> > and
>>>> > delete the state to get things moving. Obviously, that's not great. I'll
>>>> > note that the flow.xml file has been changed, as I added/removed
>>>> > processors,
>>>> > by the new version of nifi but 95% of the flow configuration is the same
>>>> > as
>>>> > before the upgrade. So, I'm wondering if there is a configuration
>>>> > setting
>>>> > that causes these deadlocks.
>>>> >
>>>> > What I've been able to observe is that the deadlock is "gradual" in that
>>>> > my
>>>> > flow usually takes about 4-5 threads to execute. The deadlock causes the
>>>> > worker threads to max out at the limit and I'm not even able to stop any
>>>> > processors or list queues. I also, have not seen this behavior in a
>>>> > fresh
>>>> > install of Nifi where the flow.xml would start out empty.
>>>> >
>>>> > Can you give me some advise on what to do about this? Would the problem
>>>> > be
>>>> > resolved if I manually rebuild the flow with the new version of Nifi
>>>> > (not
>>>> > looking forward to that)?
>>>> >
>>>> > Much appreciated.
>>>> >
>>>> > Mike.
>>>> >
>>>> > This email may contain material that is confidential for the sole use of
>>>> > the
>>>> > intended recipient(s).  Any review, reliance or distribution or
>>>> > disclosure
>>>> > by others without express permission is strictly prohibited.  If you are
>>>> > not
>>>> > the intended recipient, please contact the sender and delete all copies
>>>> > of
>>>> > this message.
>>>
>>>
>>>
>>> This email may contain material that is confidential for the sole use of the
>>> intended recipient(s).  Any review, reliance or distribution or disclosure
>>> by others without express permission is strictly prohibited.  If you are not
>>> the intended recipient, please contact the sender and delete all copies of
>>> this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Andy LoPresto <al...@apache.org>.
Russ,

The ticket you reference [1] is still open, and I did not see any changes in 0.7.2 or 1.1.2 that would indicate any fix was included. You can create a PR with your code in it (or ask someone to do it if you’re not comfortable with GitHub).

https://issues.apache.org/jira/browse/NIFI-3364 <https://issues.apache.org/jira/browse/NIFI-3364>

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Feb 17, 2017, at 8:31 AM, Russell Bateman <ru...@windofkeltia.com> wrote:
> 
> Mikhail,
> 
> I have a short article with step-by-step information and comments on how I profile NiFi. You'll want the latest NiFi release, however, because the Java Flight Recorder JVM arguments are very order-dependent. (I'm assuming that NiFi 1.1.2 and 0.7.2 have the fix for conf/bootstrap.conf numeric-argument order.) I've been using this for a couple of months and finally got around to writing it up from my personal notes in a more usable form:
> 
> http://www.javahotchocolate.com/notes/jfr.html <http://www.javahotchocolate.com/notes/jfr.html>
> 
> I hope this is helpful.
> 
> Russ
> 
> On 02/16/2017 10:18 PM, Mikhail Sosonkin wrote:
>> Been a while since I've used a profiler, but I'll give it a shot when I get to a place with faster internet link :)
>> 
>> On Fri, Feb 17, 2017 at 12:08 AM, Tony Kurc <trkurc@gmail.com <ma...@gmail.com>> wrote:
>> Mike, also if what Joe asked with the backpressure is "not being applied", if you're good with a profiler, I think joe and I both gravitated to 0x00000006c533b770 being locked in at org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757). It would be interesting to see if that section is taking longer over time.
>> 
>> On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt < <ma...@gmail.com>joe.witt@gmail.com <ma...@gmail.com>> wrote:
>> Mike
>> 
>> One more thing...can you please grab a couple more thread dumps for us
>> with 5 to 10 mins between?
>> 
>> I don't see a deadlock but do suspect either just crazy slow IO going
>> on or a possible livelock.  The thread dump will help narrow that down
>> a bit.
>> 
>> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>> system too please.
>> 
>> Thanks
>> Joe
>> 
>> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
>> > Mike,
>> >
>> > No need for more info.  Heap/GC looks beautiful.
>> >
>> > The thread dump however, shows some problems.  The provenance
>> > repository is locked up.  Numerous threads are sitting here
>> >
>> > at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>> > at org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>> >
>> > This means these are processors committing their sessions and updating
>> > provenance but they're waiting on a readlock to provenance.  This lock
>> > cannot be obtained because a provenance maintenance thread is
>> > attempting to purge old events and cannot.
>> >
>> > I recall us having addressed this so am looking to see when that was
>> > addressed.  If provenance is not critical for you right now you can
>> > swap out the persistent implementation with the volatile provenance
>> > repository.  In nifi.properties change this line
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>> >
>> > to
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>> >
>> > The behavior reminds me of this issue which was fixed in 1.x
>> > https://issues.apache.org/jira/browse/NIFI-2395 <https://issues.apache.org/jira/browse/NIFI-2395>
>> >
>> > Need to dig into this more...
>> >
>> > Thanks
>> > Joe
>> >
>> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mikhail@synack.com <ma...@synack.com>> wrote:
>> >> Hi Joe,
>> >>
>> >> Thank you for your quick response. The system is currently in the deadlock
>> >> state with 10 worker threads spinning. So, I'll gather the info you
>> >> requested.
>> >>
>> >> - The available space on the partition is 223G free of 500G (same as was
>> >> available for 0.6.1)
>> >> - java.arg.3=-Xmx4096m in bootstrap.conf
>> >> - thread dump and jstats are here
>> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085 <https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085>
>> >>
>> >> Unfortunately, it's hard to predict when the decay starts and it takes too
>> >> long to have to monitor the system manually. However, if you still need,
>> >> after seeing the attached dumps, the thread dumps while it decays I can set
>> >> up a timer script.
>> >>
>> >> Let me know if you need any more info.
>> >>
>> >> Thanks,
>> >> Mike.
>> >>
>> >>
>> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
>> >>>
>> >>> Mike,
>> >>>
>> >>> Can you capture a series of thread dumps as the gradual decay occurs
>> >>> and signal at what point they were generated specifically calling out
>> >>> the "now the system is doing nothing" point.  Can you check for space
>> >>> available on the system during these times as well.  Also, please
>> >>> advise on the behavior of the heap/garbage collection.  Often (not
>> >>> always) a gradual decay in performance can suggest an issue with GC as
>> >>> you know.  Can you run something like
>> >>>
>> >>> jstat -gcutil -h5 <pid> 1000
>> >>>
>> >>> And capture those rules in these chunks as well.
>> >>>
>> >>> This would give us a pretty good picture of the health of the system/
>> >>> and JVM around these times.  It is probably too much for the mailing
>> >>> list for the info so feel free to create a JIRA for this and put
>> >>> attachments there or link to gists in github/etc.
>> >>>
>> >>> Pretty confident we can get to the bottom of what you're seeing quickly.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mikhail@synack.com <ma...@synack.com>>
>> >>> wrote:
>> >>> > Hello,
>> >>> >
>> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was
>> >>> > working well. However, a few hours later none of the processors were
>> >>> > showing
>> >>> > any activity. Then, I tried restarting nifi which caused some flowfiles
>> >>> > to
>> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> >>> > however
>> >>> > the processors still continue to produce no activity. Next, I stop the
>> >>> > service and delete all state (content_repository database_repository
>> >>> > flowfile_repository provenance_repository work). Then the processors
>> >>> > start
>> >>> > working for a few hours (maybe a day) until the deadlock occurs again.
>> >>> >
>> >>> > So, this cycle continues where I have to periodically reset the service
>> >>> > and
>> >>> > delete the state to get things moving. Obviously, that's not great. I'll
>> >>> > note that the flow.xml file has been changed, as I added/removed
>> >>> > processors,
>> >>> > by the new version of nifi but 95% of the flow configuration is the same
>> >>> > as
>> >>> > before the upgrade. So, I'm wondering if there is a configuration
>> >>> > setting
>> >>> > that causes these deadlocks.
>> >>> >
>> >>> > What I've been able to observe is that the deadlock is "gradual" in that
>> >>> > my
>> >>> > flow usually takes about 4-5 threads to execute. The deadlock causes the
>> >>> > worker threads to max out at the limit and I'm not even able to stop any
>> >>> > processors or list queues. I also, have not seen this behavior in a
>> >>> > fresh
>> >>> > install of Nifi where the flow.xml would start out empty.
>> >>> >
>> >>> > Can you give me some advise on what to do about this? Would the problem
>> >>> > be
>> >>> > resolved if I manually rebuild the flow with the new version of Nifi
>> >>> > (not
>> >>> > looking forward to that)?
>> >>> >
>> >>> > Much appreciated.
>> >>> >
>> >>> > Mike.
>> >>> >
>> >>> > This email may contain material that is confidential for the sole use of
>> >>> > the
>> >>> > intended recipient(s).  Any review, reliance or distribution or
>> >>> > disclosure
>> >>> > by others without express permission is strictly prohibited.  If you are
>> >>> > not
>> >>> > the intended recipient, please contact the sender and delete all copies
>> >>> > of
>> >>> > this message.
>> >>
>> >>
>> >>
>> >> This email may contain material that is confidential for the sole use of the
>> >> intended recipient(s).  Any review, reliance or distribution or disclosure
>> >> by others without express permission is strictly prohibited.  If you are not
>> >> the intended recipient, please contact the sender and delete all copies of
>> >> this message.
>> 
>> 
>> 
>> This email may contain material that is confidential for the sole use of the intended recipient(s).  Any review, reliance or distribution or disclosure by others without express permission is strictly prohibited.  If you are not the intended recipient, please contact the sender and delete all copies of this message.
> 


Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Russell Bateman <ru...@windofkeltia.com>.
Mikhail,

I have a short article with step-by-step information and comments on how 
I profile NiFi. You'll want the latest NiFi release, however, because 
the Java Flight Recorder JVM arguments are very order-dependent. (I'm 
assuming that NiFi 1.1.2 and 0.7.2 have the fix for 
/conf/bootstrap.conf/ numeric-argument order.) I've been using this for 
a couple of months and finally got around to writing it up from my 
personal notes in a more usable form:

http://www.javahotchocolate.com/notes/jfr.html

I hope this is helpful.

Russ

On 02/16/2017 10:18 PM, Mikhail Sosonkin wrote:
> Been a while since I've used a profiler, but I'll give it a shot when 
> I get to a place with faster internet link :)
>
> On Fri, Feb 17, 2017 at 12:08 AM, Tony Kurc <trkurc@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Mike, also if what Joe asked with the backpressure is "not being
>     applied", if you're good with a profiler, I think joe and I both
>     gravitated to 0x00000006c533b770 being locked in at
>     org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757).
>     It would be interesting to see if that section is taking longer
>     over time.
>
>     On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <joe.witt@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Mike
>
>         One more thing...can you please grab a couple more thread
>         dumps for us
>         with 5 to 10 mins between?
>
>         I don't see a deadlock but do suspect either just crazy slow
>         IO going
>         on or a possible livelock.  The thread dump will help narrow
>         that down
>         a bit.
>
>         Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>         system too please.
>
>         Thanks
>         Joe
>
>         On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <joe.witt@gmail.com
>         <ma...@gmail.com>> wrote:
>         > Mike,
>         >
>         > No need for more info.  Heap/GC looks beautiful.
>         >
>         > The thread dump however, shows some problems.  The provenance
>         > repository is locked up.  Numerous threads are sitting here
>         >
>         > at
>         java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>         > at
>         org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>         >
>         > This means these are processors committing their sessions
>         and updating
>         > provenance but they're waiting on a readlock to provenance. 
>         This lock
>         > cannot be obtained because a provenance maintenance thread is
>         > attempting to purge old events and cannot.
>         >
>         > I recall us having addressed this so am looking to see when
>         that was
>         > addressed.  If provenance is not critical for you right now
>         you can
>         > swap out the persistent implementation with the volatile
>         provenance
>         > repository.  In nifi.properties change this line
>         >
>         >
>         nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>         >
>         > to
>         >
>         >
>         nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>         >
>         > The behavior reminds me of this issue which was fixed in 1.x
>         > https://issues.apache.org/jira/browse/NIFI-2395
>         <https://issues.apache.org/jira/browse/NIFI-2395>
>         >
>         > Need to dig into this more...
>         >
>         > Thanks
>         > Joe
>         >
>         > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin
>         <mikhail@synack.com <ma...@synack.com>> wrote:
>         >> Hi Joe,
>         >>
>         >> Thank you for your quick response. The system is currently
>         in the deadlock
>         >> state with 10 worker threads spinning. So, I'll gather the
>         info you
>         >> requested.
>         >>
>         >> - The available space on the partition is 223G free of 500G
>         (same as was
>         >> available for 0.6.1)
>         >> - java.arg.3=-Xmx4096m in bootstrap.conf
>         >> - thread dump and jstats are here
>         >>
>         https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>         <https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085>
>         >>
>         >> Unfortunately, it's hard to predict when the decay starts
>         and it takes too
>         >> long to have to monitor the system manually. However, if
>         you still need,
>         >> after seeing the attached dumps, the thread dumps while it
>         decays I can set
>         >> up a timer script.
>         >>
>         >> Let me know if you need any more info.
>         >>
>         >> Thanks,
>         >> Mike.
>         >>
>         >>
>         >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt
>         <joe.witt@gmail.com <ma...@gmail.com>> wrote:
>         >>>
>         >>> Mike,
>         >>>
>         >>> Can you capture a series of thread dumps as the gradual
>         decay occurs
>         >>> and signal at what point they were generated specifically
>         calling out
>         >>> the "now the system is doing nothing" point.  Can you
>         check for space
>         >>> available on the system during these times as well.  Also,
>         please
>         >>> advise on the behavior of the heap/garbage collection. 
>         Often (not
>         >>> always) a gradual decay in performance can suggest an
>         issue with GC as
>         >>> you know.  Can you run something like
>         >>>
>         >>> jstat -gcutil -h5 <pid> 1000
>         >>>
>         >>> And capture those rules in these chunks as well.
>         >>>
>         >>> This would give us a pretty good picture of the health of
>         the system/
>         >>> and JVM around these times.  It is probably too much for
>         the mailing
>         >>> list for the info so feel free to create a JIRA for this
>         and put
>         >>> attachments there or link to gists in github/etc.
>         >>>
>         >>> Pretty confident we can get to the bottom of what you're
>         seeing quickly.
>         >>>
>         >>> Thanks
>         >>> Joe
>         >>>
>         >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin
>         <mikhail@synack.com <ma...@synack.com>>
>         >>> wrote:
>         >>> > Hello,
>         >>> >
>         >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at
>         first everything was
>         >>> > working well. However, a few hours later none of the
>         processors were
>         >>> > showing
>         >>> > any activity. Then, I tried restarting nifi which caused
>         some flowfiles
>         >>> > to
>         >>> > get corrupted evidenced by exceptions thrown in the
>         nifi-app.log,
>         >>> > however
>         >>> > the processors still continue to produce no activity.
>         Next, I stop the
>         >>> > service and delete all state (content_repository
>         database_repository
>         >>> > flowfile_repository provenance_repository work). Then
>         the processors
>         >>> > start
>         >>> > working for a few hours (maybe a day) until the deadlock
>         occurs again.
>         >>> >
>         >>> > So, this cycle continues where I have to periodically
>         reset the service
>         >>> > and
>         >>> > delete the state to get things moving. Obviously, that's
>         not great. I'll
>         >>> > note that the flow.xml file has been changed, as I
>         added/removed
>         >>> > processors,
>         >>> > by the new version of nifi but 95% of the flow
>         configuration is the same
>         >>> > as
>         >>> > before the upgrade. So, I'm wondering if there is a
>         configuration
>         >>> > setting
>         >>> > that causes these deadlocks.
>         >>> >
>         >>> > What I've been able to observe is that the deadlock is
>         "gradual" in that
>         >>> > my
>         >>> > flow usually takes about 4-5 threads to execute. The
>         deadlock causes the
>         >>> > worker threads to max out at the limit and I'm not even
>         able to stop any
>         >>> > processors or list queues. I also, have not seen this
>         behavior in a
>         >>> > fresh
>         >>> > install of Nifi where the flow.xml would start out empty.
>         >>> >
>         >>> > Can you give me some advise on what to do about this?
>         Would the problem
>         >>> > be
>         >>> > resolved if I manually rebuild the flow with the new
>         version of Nifi
>         >>> > (not
>         >>> > looking forward to that)?
>         >>> >
>         >>> > Much appreciated.
>         >>> >
>         >>> > Mike.
>         >>> >
>         >>> > This email may contain material that is confidential for
>         the sole use of
>         >>> > the
>         >>> > intended recipient(s).  Any review, reliance or
>         distribution or
>         >>> > disclosure
>         >>> > by others without express permission is strictly
>         prohibited.  If you are
>         >>> > not
>         >>> > the intended recipient, please contact the sender and
>         delete all copies
>         >>> > of
>         >>> > this message.
>         >>
>         >>
>         >>
>         >> This email may contain material that is confidential for
>         the sole use of the
>         >> intended recipient(s).  Any review, reliance or
>         distribution or disclosure
>         >> by others without express permission is strictly
>         prohibited.  If you are not
>         >> the intended recipient, please contact the sender and
>         delete all copies of
>         >> this message.
>
>
>
>
> This email may contain material that is confidential for the sole use 
> of the intended recipient(s).  Any review, reliance or distribution or 
> disclosure by others without express permission is strictly 
> prohibited.  If you are not the intended recipient, please contact the 
> sender and delete all copies of this message.


Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Mikhail Sosonkin <mi...@synack.com>.
Been a while since I've used a profiler, but I'll give it a shot when I get
to a place with faster internet link :)

On Fri, Feb 17, 2017 at 12:08 AM, Tony Kurc <tr...@gmail.com> wrote:

> Mike, also if what Joe asked with the backpressure is "not being applied",
> if you're good with a profiler, I think joe and I both gravitated to
> 0x00000006c533b770 being locked in at org.apache.nifi.provenance.
> PersistentProvenanceRepository.persistRecord(
> PersistentProvenanceRepository.java:757). It would be interesting to see
> if that section is taking longer over time.
>
> On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Mike
>>
>> One more thing...can you please grab a couple more thread dumps for us
>> with 5 to 10 mins between?
>>
>> I don't see a deadlock but do suspect either just crazy slow IO going
>> on or a possible livelock.  The thread dump will help narrow that down
>> a bit.
>>
>> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>> system too please.
>>
>> Thanks
>> Joe
>>
>> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
>> > Mike,
>> >
>> > No need for more info.  Heap/GC looks beautiful.
>> >
>> > The thread dump however, shows some problems.  The provenance
>> > repository is locked up.  Numerous threads are sitting here
>> >
>> > at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.
>> lock(ReentrantReadWriteLock.java:727)
>> > at org.apache.nifi.provenance.PersistentProvenanceRepository.
>> persistRecord(PersistentProvenanceRepository.java:757)
>> >
>> > This means these are processors committing their sessions and updating
>> > provenance but they're waiting on a readlock to provenance.  This lock
>> > cannot be obtained because a provenance maintenance thread is
>> > attempting to purge old events and cannot.
>> >
>> > I recall us having addressed this so am looking to see when that was
>> > addressed.  If provenance is not critical for you right now you can
>> > swap out the persistent implementation with the volatile provenance
>> > repository.  In nifi.properties change this line
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.
>> provenance.PersistentProvenanceRepository
>> >
>> > to
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.
>> provenance.VolatileProvenanceRepository
>> >
>> > The behavior reminds me of this issue which was fixed in 1.x
>> > https://issues.apache.org/jira/browse/NIFI-2395
>> >
>> > Need to dig into this more...
>> >
>> > Thanks
>> > Joe
>> >
>> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com>
>> wrote:
>> >> Hi Joe,
>> >>
>> >> Thank you for your quick response. The system is currently in the
>> deadlock
>> >> state with 10 worker threads spinning. So, I'll gather the info you
>> >> requested.
>> >>
>> >> - The available space on the partition is 223G free of 500G (same as
>> was
>> >> available for 0.6.1)
>> >> - java.arg.3=-Xmx4096m in bootstrap.conf
>> >> - thread dump and jstats are here
>> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>> >>
>> >> Unfortunately, it's hard to predict when the decay starts and it takes
>> too
>> >> long to have to monitor the system manually. However, if you still
>> need,
>> >> after seeing the attached dumps, the thread dumps while it decays I
>> can set
>> >> up a timer script.
>> >>
>> >> Let me know if you need any more info.
>> >>
>> >> Thanks,
>> >> Mike.
>> >>
>> >>
>> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
>> >>>
>> >>> Mike,
>> >>>
>> >>> Can you capture a series of thread dumps as the gradual decay occurs
>> >>> and signal at what point they were generated specifically calling out
>> >>> the "now the system is doing nothing" point.  Can you check for space
>> >>> available on the system during these times as well.  Also, please
>> >>> advise on the behavior of the heap/garbage collection.  Often (not
>> >>> always) a gradual decay in performance can suggest an issue with GC as
>> >>> you know.  Can you run something like
>> >>>
>> >>> jstat -gcutil -h5 <pid> 1000
>> >>>
>> >>> And capture those rules in these chunks as well.
>> >>>
>> >>> This would give us a pretty good picture of the health of the system/
>> >>> and JVM around these times.  It is probably too much for the mailing
>> >>> list for the info so feel free to create a JIRA for this and put
>> >>> attachments there or link to gists in github/etc.
>> >>>
>> >>> Pretty confident we can get to the bottom of what you're seeing
>> quickly.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mikhail@synack.com
>> >
>> >>> wrote:
>> >>> > Hello,
>> >>> >
>> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first
>> everything was
>> >>> > working well. However, a few hours later none of the processors were
>> >>> > showing
>> >>> > any activity. Then, I tried restarting nifi which caused some
>> flowfiles
>> >>> > to
>> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> >>> > however
>> >>> > the processors still continue to produce no activity. Next, I stop
>> the
>> >>> > service and delete all state (content_repository database_repository
>> >>> > flowfile_repository provenance_repository work). Then the processors
>> >>> > start
>> >>> > working for a few hours (maybe a day) until the deadlock occurs
>> again.
>> >>> >
>> >>> > So, this cycle continues where I have to periodically reset the
>> service
>> >>> > and
>> >>> > delete the state to get things moving. Obviously, that's not great.
>> I'll
>> >>> > note that the flow.xml file has been changed, as I added/removed
>> >>> > processors,
>> >>> > by the new version of nifi but 95% of the flow configuration is the
>> same
>> >>> > as
>> >>> > before the upgrade. So, I'm wondering if there is a configuration
>> >>> > setting
>> >>> > that causes these deadlocks.
>> >>> >
>> >>> > What I've been able to observe is that the deadlock is "gradual" in
>> that
>> >>> > my
>> >>> > flow usually takes about 4-5 threads to execute. The deadlock
>> causes the
>> >>> > worker threads to max out at the limit and I'm not even able to
>> stop any
>> >>> > processors or list queues. I also, have not seen this behavior in a
>> >>> > fresh
>> >>> > install of Nifi where the flow.xml would start out empty.
>> >>> >
>> >>> > Can you give me some advise on what to do about this? Would the
>> problem
>> >>> > be
>> >>> > resolved if I manually rebuild the flow with the new version of Nifi
>> >>> > (not
>> >>> > looking forward to that)?
>> >>> >
>> >>> > Much appreciated.
>> >>> >
>> >>> > Mike.
>> >>> >
>> >>> > This email may contain material that is confidential for the sole
>> use of
>> >>> > the
>> >>> > intended recipient(s).  Any review, reliance or distribution or
>> >>> > disclosure
>> >>> > by others without express permission is strictly prohibited.  If
>> you are
>> >>> > not
>> >>> > the intended recipient, please contact the sender and delete all
>> copies
>> >>> > of
>> >>> > this message.
>> >>
>> >>
>> >>
>> >> This email may contain material that is confidential for the sole use
>> of the
>> >> intended recipient(s).  Any review, reliance or distribution or
>> disclosure
>> >> by others without express permission is strictly prohibited.  If you
>> are not
>> >> the intended recipient, please contact the sender and delete all
>> copies of
>> >> this message.
>>
>
>

-- 
This email may contain material that is confidential for the sole use of 
the intended recipient(s).  Any review, reliance or distribution or 
disclosure by others without express permission is strictly prohibited.  If 
you are not the intended recipient, please contact the sender and delete 
all copies of this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Tony Kurc <tr...@gmail.com>.
Mike, also if what Joe asked with the backpressure is "not being applied",
if you're good with a profiler, I think joe and I both gravitated to
0x00000006c533b770 being locked in at
org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757).
It would be interesting to see if that section is taking longer over time.

On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:

> Mike
>
> One more thing...can you please grab a couple more thread dumps for us
> with 5 to 10 mins between?
>
> I don't see a deadlock but do suspect either just crazy slow IO going
> on or a possible livelock.  The thread dump will help narrow that down
> a bit.
>
> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
> system too please.
>
> Thanks
> Joe
>
> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
> > Mike,
> >
> > No need for more info.  Heap/GC looks beautiful.
> >
> > The thread dump however, shows some problems.  The provenance
> > repository is locked up.  Numerous threads are sitting here
> >
> > at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(
> ReentrantReadWriteLock.java:727)
> > at org.apache.nifi.provenance.PersistentProvenanceRepository
> .persistRecord(PersistentProvenanceRepository.java:757)
> >
> > This means these are processors committing their sessions and updating
> > provenance but they're waiting on a readlock to provenance.  This lock
> > cannot be obtained because a provenance maintenance thread is
> > attempting to purge old events and cannot.
> >
> > I recall us having addressed this so am looking to see when that was
> > addressed.  If provenance is not critical for you right now you can
> > swap out the persistent implementation with the volatile provenance
> > repository.  In nifi.properties change this line
> >
> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.
> PersistentProvenanceRepository
> >
> > to
> >
> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.
> VolatileProvenanceRepository
> >
> > The behavior reminds me of this issue which was fixed in 1.x
> > https://issues.apache.org/jira/browse/NIFI-2395
> >
> > Need to dig into this more...
> >
> > Thanks
> > Joe
> >
> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com>
> wrote:
> >> Hi Joe,
> >>
> >> Thank you for your quick response. The system is currently in the
> deadlock
> >> state with 10 worker threads spinning. So, I'll gather the info you
> >> requested.
> >>
> >> - The available space on the partition is 223G free of 500G (same as was
> >> available for 0.6.1)
> >> - java.arg.3=-Xmx4096m in bootstrap.conf
> >> - thread dump and jstats are here
> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
> >>
> >> Unfortunately, it's hard to predict when the decay starts and it takes
> too
> >> long to have to monitor the system manually. However, if you still need,
> >> after seeing the attached dumps, the thread dumps while it decays I can
> set
> >> up a timer script.
> >>
> >> Let me know if you need any more info.
> >>
> >> Thanks,
> >> Mike.
> >>
> >>
> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
> >>>
> >>> Mike,
> >>>
> >>> Can you capture a series of thread dumps as the gradual decay occurs
> >>> and signal at what point they were generated specifically calling out
> >>> the "now the system is doing nothing" point.  Can you check for space
> >>> available on the system during these times as well.  Also, please
> >>> advise on the behavior of the heap/garbage collection.  Often (not
> >>> always) a gradual decay in performance can suggest an issue with GC as
> >>> you know.  Can you run something like
> >>>
> >>> jstat -gcutil -h5 <pid> 1000
> >>>
> >>> And capture those rules in these chunks as well.
> >>>
> >>> This would give us a pretty good picture of the health of the system/
> >>> and JVM around these times.  It is probably too much for the mailing
> >>> list for the info so feel free to create a JIRA for this and put
> >>> attachments there or link to gists in github/etc.
> >>>
> >>> Pretty confident we can get to the bottom of what you're seeing
> quickly.
> >>>
> >>> Thanks
> >>> Joe
> >>>
> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mi...@synack.com>
> >>> wrote:
> >>> > Hello,
> >>> >
> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything
> was
> >>> > working well. However, a few hours later none of the processors were
> >>> > showing
> >>> > any activity. Then, I tried restarting nifi which caused some
> flowfiles
> >>> > to
> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
> >>> > however
> >>> > the processors still continue to produce no activity. Next, I stop
> the
> >>> > service and delete all state (content_repository database_repository
> >>> > flowfile_repository provenance_repository work). Then the processors
> >>> > start
> >>> > working for a few hours (maybe a day) until the deadlock occurs
> again.
> >>> >
> >>> > So, this cycle continues where I have to periodically reset the
> service
> >>> > and
> >>> > delete the state to get things moving. Obviously, that's not great.
> I'll
> >>> > note that the flow.xml file has been changed, as I added/removed
> >>> > processors,
> >>> > by the new version of nifi but 95% of the flow configuration is the
> same
> >>> > as
> >>> > before the upgrade. So, I'm wondering if there is a configuration
> >>> > setting
> >>> > that causes these deadlocks.
> >>> >
> >>> > What I've been able to observe is that the deadlock is "gradual" in
> that
> >>> > my
> >>> > flow usually takes about 4-5 threads to execute. The deadlock causes
> the
> >>> > worker threads to max out at the limit and I'm not even able to stop
> any
> >>> > processors or list queues. I also, have not seen this behavior in a
> >>> > fresh
> >>> > install of Nifi where the flow.xml would start out empty.
> >>> >
> >>> > Can you give me some advise on what to do about this? Would the
> problem
> >>> > be
> >>> > resolved if I manually rebuild the flow with the new version of Nifi
> >>> > (not
> >>> > looking forward to that)?
> >>> >
> >>> > Much appreciated.
> >>> >
> >>> > Mike.
> >>> >
> >>> > This email may contain material that is confidential for the sole
> use of
> >>> > the
> >>> > intended recipient(s).  Any review, reliance or distribution or
> >>> > disclosure
> >>> > by others without express permission is strictly prohibited.  If you
> are
> >>> > not
> >>> > the intended recipient, please contact the sender and delete all
> copies
> >>> > of
> >>> > this message.
> >>
> >>
> >>
> >> This email may contain material that is confidential for the sole use
> of the
> >> intended recipient(s).  Any review, reliance or distribution or
> disclosure
> >> by others without express permission is strictly prohibited.  If you
> are not
> >> the intended recipient, please contact the sender and delete all copies
> of
> >> this message.
>

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Joe Witt <jo...@gmail.com>.
Cool thanks Mike.  Mount question/concern resolved.

On Fri, Feb 17, 2017 at 12:33 AM, Mikhail Sosonkin <mi...@synack.com> wrote:
> I'm really happy that you guys responded so well, it's quite lonely googling
> for this stuff :)
>
> Right now the volume is high because nifi is catching up at about 2.5G/5m
> and 500FlowFile/5m, but normally we're at about 100Mb/5m with a few spikes
> here and there, nothing too intense.
>
> We are using an EC2 instance with 32G RAM and 500G SSD. All the work is done
> on the same mount. Not sure what you mean by timestamps in this case. Our
> set up is pretty close to out of the box with only heap size limit change in
> bootstrap and a few Groovy based processors.
>
> I'll try to get you some thread dumps for the decay, might have to wait
> until tomorrow or monday though. I want to see if I can get it to behave
> like this on a test system.
>
> Mike.
>
> On Fri, Feb 17, 2017 at 12:13 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>> Mike
>>
>> Totally get it.  If you are able to on this or another system get back
>> into that state we're highly interested to learn more.  In looking at
>> the code relevant to your stack trace I'm not quite seeing the trail
>> just yet.  The problem is definitely with the persistent prov.
>> Getting the phased thread dumps will help tell more of the story.
>>
>> Also, can you tell us anything about the volume/mount that the nifi
>> install and specific provenance is on?  Any interesting mount options
>> involving timestamps, etc..?
>>
>> No rush of course and glad you're back in business.  But, you've
>> definitely got our attention :-)
>>
>> Thanks
>> Joe
>>
>> On Fri, Feb 17, 2017 at 12:10 AM, Mikhail Sosonkin <mi...@synack.com>
>> wrote:
>> > Joe,
>> >
>> > Many thanks for the pointer on the Volatile provenance. It is, indeed,
>> > more
>> > critical for us that the data moves. Before receiving this message, I
>> > changed the config and restarted. The data started moving which is
>> > awesome!
>> >
>> > I'm happy to help you debug this issue. Do you need these collections
>> > with
>> > the volatile setting or persistent setting in locked state?
>> >
>> > Mike.
>> >
>> > On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:
>> >>
>> >> Mike
>> >>
>> >> One more thing...can you please grab a couple more thread dumps for us
>> >> with 5 to 10 mins between?
>> >>
>> >> I don't see a deadlock but do suspect either just crazy slow IO going
>> >> on or a possible livelock.  The thread dump will help narrow that down
>> >> a bit.
>> >>
>> >> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>> >> system too please.
>> >>
>> >> Thanks
>> >> Joe
>> >>
>> >> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
>> >> > Mike,
>> >> >
>> >> > No need for more info.  Heap/GC looks beautiful.
>> >> >
>> >> > The thread dump however, shows some problems.  The provenance
>> >> > repository is locked up.  Numerous threads are sitting here
>> >> >
>> >> > at
>> >> >
>> >> > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>> >> > at
>> >> >
>> >> > org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>> >> >
>> >> > This means these are processors committing their sessions and
>> >> > updating
>> >> > provenance but they're waiting on a readlock to provenance.  This
>> >> > lock
>> >> > cannot be obtained because a provenance maintenance thread is
>> >> > attempting to purge old events and cannot.
>> >> >
>> >> > I recall us having addressed this so am looking to see when that was
>> >> > addressed.  If provenance is not critical for you right now you can
>> >> > swap out the persistent implementation with the volatile provenance
>> >> > repository.  In nifi.properties change this line
>> >> >
>> >> >
>> >> >
>> >> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>> >> >
>> >> > to
>> >> >
>> >> >
>> >> >
>> >> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>> >> >
>> >> > The behavior reminds me of this issue which was fixed in 1.x
>> >> > https://issues.apache.org/jira/browse/NIFI-2395
>> >> >
>> >> > Need to dig into this more...
>> >> >
>> >> > Thanks
>> >> > Joe
>> >> >
>> >> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin
>> >> > <mi...@synack.com>
>> >> > wrote:
>> >> >> Hi Joe,
>> >> >>
>> >> >> Thank you for your quick response. The system is currently in the
>> >> >> deadlock
>> >> >> state with 10 worker threads spinning. So, I'll gather the info you
>> >> >> requested.
>> >> >>
>> >> >> - The available space on the partition is 223G free of 500G (same as
>> >> >> was
>> >> >> available for 0.6.1)
>> >> >> - java.arg.3=-Xmx4096m in bootstrap.conf
>> >> >> - thread dump and jstats are here
>> >> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>> >> >>
>> >> >> Unfortunately, it's hard to predict when the decay starts and it
>> >> >> takes
>> >> >> too
>> >> >> long to have to monitor the system manually. However, if you still
>> >> >> need,
>> >> >> after seeing the attached dumps, the thread dumps while it decays I
>> >> >> can
>> >> >> set
>> >> >> up a timer script.
>> >> >>
>> >> >> Let me know if you need any more info.
>> >> >>
>> >> >> Thanks,
>> >> >> Mike.
>> >> >>
>> >> >>
>> >> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Mike,
>> >> >>>
>> >> >>> Can you capture a series of thread dumps as the gradual decay
>> >> >>> occurs
>> >> >>> and signal at what point they were generated specifically calling
>> >> >>> out
>> >> >>> the "now the system is doing nothing" point.  Can you check for
>> >> >>> space
>> >> >>> available on the system during these times as well.  Also, please
>> >> >>> advise on the behavior of the heap/garbage collection.  Often (not
>> >> >>> always) a gradual decay in performance can suggest an issue with GC
>> >> >>> as
>> >> >>> you know.  Can you run something like
>> >> >>>
>> >> >>> jstat -gcutil -h5 <pid> 1000
>> >> >>>
>> >> >>> And capture those rules in these chunks as well.
>> >> >>>
>> >> >>> This would give us a pretty good picture of the health of the
>> >> >>> system/
>> >> >>> and JVM around these times.  It is probably too much for the
>> >> >>> mailing
>> >> >>> list for the info so feel free to create a JIRA for this and put
>> >> >>> attachments there or link to gists in github/etc.
>> >> >>>
>> >> >>> Pretty confident we can get to the bottom of what you're seeing
>> >> >>> quickly.
>> >> >>>
>> >> >>> Thanks
>> >> >>> Joe
>> >> >>>
>> >> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin
>> >> >>> <mi...@synack.com>
>> >> >>> wrote:
>> >> >>> > Hello,
>> >> >>> >
>> >> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first
>> >> >>> > everything
>> >> >>> > was
>> >> >>> > working well. However, a few hours later none of the processors
>> >> >>> > were
>> >> >>> > showing
>> >> >>> > any activity. Then, I tried restarting nifi which caused some
>> >> >>> > flowfiles
>> >> >>> > to
>> >> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> >> >>> > however
>> >> >>> > the processors still continue to produce no activity. Next, I
>> >> >>> > stop
>> >> >>> > the
>> >> >>> > service and delete all state (content_repository
>> >> >>> > database_repository
>> >> >>> > flowfile_repository provenance_repository work). Then the
>> >> >>> > processors
>> >> >>> > start
>> >> >>> > working for a few hours (maybe a day) until the deadlock occurs
>> >> >>> > again.
>> >> >>> >
>> >> >>> > So, this cycle continues where I have to periodically reset the
>> >> >>> > service
>> >> >>> > and
>> >> >>> > delete the state to get things moving. Obviously, that's not
>> >> >>> > great.
>> >> >>> > I'll
>> >> >>> > note that the flow.xml file has been changed, as I added/removed
>> >> >>> > processors,
>> >> >>> > by the new version of nifi but 95% of the flow configuration is
>> >> >>> > the
>> >> >>> > same
>> >> >>> > as
>> >> >>> > before the upgrade. So, I'm wondering if there is a configuration
>> >> >>> > setting
>> >> >>> > that causes these deadlocks.
>> >> >>> >
>> >> >>> > What I've been able to observe is that the deadlock is "gradual"
>> >> >>> > in
>> >> >>> > that
>> >> >>> > my
>> >> >>> > flow usually takes about 4-5 threads to execute. The deadlock
>> >> >>> > causes
>> >> >>> > the
>> >> >>> > worker threads to max out at the limit and I'm not even able to
>> >> >>> > stop
>> >> >>> > any
>> >> >>> > processors or list queues. I also, have not seen this behavior in
>> >> >>> > a
>> >> >>> > fresh
>> >> >>> > install of Nifi where the flow.xml would start out empty.
>> >> >>> >
>> >> >>> > Can you give me some advise on what to do about this? Would the
>> >> >>> > problem
>> >> >>> > be
>> >> >>> > resolved if I manually rebuild the flow with the new version of
>> >> >>> > Nifi
>> >> >>> > (not
>> >> >>> > looking forward to that)?
>> >> >>> >
>> >> >>> > Much appreciated.
>> >> >>> >
>> >> >>> > Mike.
>> >> >>> >
>> >> >>> > This email may contain material that is confidential for the sole
>> >> >>> > use of
>> >> >>> > the
>> >> >>> > intended recipient(s).  Any review, reliance or distribution or
>> >> >>> > disclosure
>> >> >>> > by others without express permission is strictly prohibited.  If
>> >> >>> > you
>> >> >>> > are
>> >> >>> > not
>> >> >>> > the intended recipient, please contact the sender and delete all
>> >> >>> > copies
>> >> >>> > of
>> >> >>> > this message.
>> >> >>
>> >> >>
>> >> >>
>> >> >> This email may contain material that is confidential for the sole
>> >> >> use
>> >> >> of the
>> >> >> intended recipient(s).  Any review, reliance or distribution or
>> >> >> disclosure
>> >> >> by others without express permission is strictly prohibited.  If you
>> >> >> are not
>> >> >> the intended recipient, please contact the sender and delete all
>> >> >> copies
>> >> >> of
>> >> >> this message.
>> >
>> >
>> >
>> > This email may contain material that is confidential for the sole use of
>> > the
>> > intended recipient(s).  Any review, reliance or distribution or
>> > disclosure
>> > by others without express permission is strictly prohibited.  If you are
>> > not
>> > the intended recipient, please contact the sender and delete all copies
>> > of
>> > this message.
>
>
>
> This email may contain material that is confidential for the sole use of the
> intended recipient(s).  Any review, reliance or distribution or disclosure
> by others without express permission is strictly prohibited.  If you are not
> the intended recipient, please contact the sender and delete all copies of
> this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Mikhail Sosonkin <mi...@synack.com>.
I'm really happy that you guys responded so well, it's quite lonely
googling for this stuff :)

Right now the volume is high because nifi is catching up at about 2.5G/5m
and 500FlowFile/5m, but normally we're at about 100Mb/5m with a few spikes
here and there, nothing too intense.

We are using an EC2 instance with 32G RAM and 500G SSD. All the work is
done on the same mount. Not sure what you mean by timestamps in this case.
Our set up is pretty close to out of the box with only heap size limit
change in bootstrap and a few Groovy based processors.

I'll try to get you some thread dumps for the decay, might have to wait
until tomorrow or monday though. I want to see if I can get it to behave
like this on a test system.

Mike.

On Fri, Feb 17, 2017 at 12:13 AM, Joe Witt <jo...@gmail.com> wrote:

> Mike
>
> Totally get it.  If you are able to on this or another system get back
> into that state we're highly interested to learn more.  In looking at
> the code relevant to your stack trace I'm not quite seeing the trail
> just yet.  The problem is definitely with the persistent prov.
> Getting the phased thread dumps will help tell more of the story.
>
> Also, can you tell us anything about the volume/mount that the nifi
> install and specific provenance is on?  Any interesting mount options
> involving timestamps, etc..?
>
> No rush of course and glad you're back in business.  But, you've
> definitely got our attention :-)
>
> Thanks
> Joe
>
> On Fri, Feb 17, 2017 at 12:10 AM, Mikhail Sosonkin <mi...@synack.com>
> wrote:
> > Joe,
> >
> > Many thanks for the pointer on the Volatile provenance. It is, indeed,
> more
> > critical for us that the data moves. Before receiving this message, I
> > changed the config and restarted. The data started moving which is
> awesome!
> >
> > I'm happy to help you debug this issue. Do you need these collections
> with
> > the volatile setting or persistent setting in locked state?
> >
> > Mike.
> >
> > On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:
> >>
> >> Mike
> >>
> >> One more thing...can you please grab a couple more thread dumps for us
> >> with 5 to 10 mins between?
> >>
> >> I don't see a deadlock but do suspect either just crazy slow IO going
> >> on or a possible livelock.  The thread dump will help narrow that down
> >> a bit.
> >>
> >> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
> >> system too please.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
> >> > Mike,
> >> >
> >> > No need for more info.  Heap/GC looks beautiful.
> >> >
> >> > The thread dump however, shows some problems.  The provenance
> >> > repository is locked up.  Numerous threads are sitting here
> >> >
> >> > at
> >> > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(
> ReentrantReadWriteLock.java:727)
> >> > at
> >> > org.apache.nifi.provenance.PersistentProvenanceRepository
> .persistRecord(PersistentProvenanceRepository.java:757)
> >> >
> >> > This means these are processors committing their sessions and updating
> >> > provenance but they're waiting on a readlock to provenance.  This lock
> >> > cannot be obtained because a provenance maintenance thread is
> >> > attempting to purge old events and cannot.
> >> >
> >> > I recall us having addressed this so am looking to see when that was
> >> > addressed.  If provenance is not critical for you right now you can
> >> > swap out the persistent implementation with the volatile provenance
> >> > repository.  In nifi.properties change this line
> >> >
> >> >
> >> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.
> PersistentProvenanceRepository
> >> >
> >> > to
> >> >
> >> >
> >> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.
> VolatileProvenanceRepository
> >> >
> >> > The behavior reminds me of this issue which was fixed in 1.x
> >> > https://issues.apache.org/jira/browse/NIFI-2395
> >> >
> >> > Need to dig into this more...
> >> >
> >> > Thanks
> >> > Joe
> >> >
> >> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <
> mikhail@synack.com>
> >> > wrote:
> >> >> Hi Joe,
> >> >>
> >> >> Thank you for your quick response. The system is currently in the
> >> >> deadlock
> >> >> state with 10 worker threads spinning. So, I'll gather the info you
> >> >> requested.
> >> >>
> >> >> - The available space on the partition is 223G free of 500G (same as
> >> >> was
> >> >> available for 0.6.1)
> >> >> - java.arg.3=-Xmx4096m in bootstrap.conf
> >> >> - thread dump and jstats are here
> >> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
> >> >>
> >> >> Unfortunately, it's hard to predict when the decay starts and it
> takes
> >> >> too
> >> >> long to have to monitor the system manually. However, if you still
> >> >> need,
> >> >> after seeing the attached dumps, the thread dumps while it decays I
> can
> >> >> set
> >> >> up a timer script.
> >> >>
> >> >> Let me know if you need any more info.
> >> >>
> >> >> Thanks,
> >> >> Mike.
> >> >>
> >> >>
> >> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com>
> wrote:
> >> >>>
> >> >>> Mike,
> >> >>>
> >> >>> Can you capture a series of thread dumps as the gradual decay occurs
> >> >>> and signal at what point they were generated specifically calling
> out
> >> >>> the "now the system is doing nothing" point.  Can you check for
> space
> >> >>> available on the system during these times as well.  Also, please
> >> >>> advise on the behavior of the heap/garbage collection.  Often (not
> >> >>> always) a gradual decay in performance can suggest an issue with GC
> as
> >> >>> you know.  Can you run something like
> >> >>>
> >> >>> jstat -gcutil -h5 <pid> 1000
> >> >>>
> >> >>> And capture those rules in these chunks as well.
> >> >>>
> >> >>> This would give us a pretty good picture of the health of the
> system/
> >> >>> and JVM around these times.  It is probably too much for the mailing
> >> >>> list for the info so feel free to create a JIRA for this and put
> >> >>> attachments there or link to gists in github/etc.
> >> >>>
> >> >>> Pretty confident we can get to the bottom of what you're seeing
> >> >>> quickly.
> >> >>>
> >> >>> Thanks
> >> >>> Joe
> >> >>>
> >> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <
> mikhail@synack.com>
> >> >>> wrote:
> >> >>> > Hello,
> >> >>> >
> >> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first
> everything
> >> >>> > was
> >> >>> > working well. However, a few hours later none of the processors
> were
> >> >>> > showing
> >> >>> > any activity. Then, I tried restarting nifi which caused some
> >> >>> > flowfiles
> >> >>> > to
> >> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
> >> >>> > however
> >> >>> > the processors still continue to produce no activity. Next, I stop
> >> >>> > the
> >> >>> > service and delete all state (content_repository
> database_repository
> >> >>> > flowfile_repository provenance_repository work). Then the
> processors
> >> >>> > start
> >> >>> > working for a few hours (maybe a day) until the deadlock occurs
> >> >>> > again.
> >> >>> >
> >> >>> > So, this cycle continues where I have to periodically reset the
> >> >>> > service
> >> >>> > and
> >> >>> > delete the state to get things moving. Obviously, that's not
> great.
> >> >>> > I'll
> >> >>> > note that the flow.xml file has been changed, as I added/removed
> >> >>> > processors,
> >> >>> > by the new version of nifi but 95% of the flow configuration is
> the
> >> >>> > same
> >> >>> > as
> >> >>> > before the upgrade. So, I'm wondering if there is a configuration
> >> >>> > setting
> >> >>> > that causes these deadlocks.
> >> >>> >
> >> >>> > What I've been able to observe is that the deadlock is "gradual"
> in
> >> >>> > that
> >> >>> > my
> >> >>> > flow usually takes about 4-5 threads to execute. The deadlock
> causes
> >> >>> > the
> >> >>> > worker threads to max out at the limit and I'm not even able to
> stop
> >> >>> > any
> >> >>> > processors or list queues. I also, have not seen this behavior in
> a
> >> >>> > fresh
> >> >>> > install of Nifi where the flow.xml would start out empty.
> >> >>> >
> >> >>> > Can you give me some advise on what to do about this? Would the
> >> >>> > problem
> >> >>> > be
> >> >>> > resolved if I manually rebuild the flow with the new version of
> Nifi
> >> >>> > (not
> >> >>> > looking forward to that)?
> >> >>> >
> >> >>> > Much appreciated.
> >> >>> >
> >> >>> > Mike.
> >> >>> >
> >> >>> > This email may contain material that is confidential for the sole
> >> >>> > use of
> >> >>> > the
> >> >>> > intended recipient(s).  Any review, reliance or distribution or
> >> >>> > disclosure
> >> >>> > by others without express permission is strictly prohibited.  If
> you
> >> >>> > are
> >> >>> > not
> >> >>> > the intended recipient, please contact the sender and delete all
> >> >>> > copies
> >> >>> > of
> >> >>> > this message.
> >> >>
> >> >>
> >> >>
> >> >> This email may contain material that is confidential for the sole use
> >> >> of the
> >> >> intended recipient(s).  Any review, reliance or distribution or
> >> >> disclosure
> >> >> by others without express permission is strictly prohibited.  If you
> >> >> are not
> >> >> the intended recipient, please contact the sender and delete all
> copies
> >> >> of
> >> >> this message.
> >
> >
> >
> > This email may contain material that is confidential for the sole use of
> the
> > intended recipient(s).  Any review, reliance or distribution or
> disclosure
> > by others without express permission is strictly prohibited.  If you are
> not
> > the intended recipient, please contact the sender and delete all copies
> of
> > this message.
>

-- 
This email may contain material that is confidential for the sole use of 
the intended recipient(s).  Any review, reliance or distribution or 
disclosure by others without express permission is strictly prohibited.  If 
you are not the intended recipient, please contact the sender and delete 
all copies of this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Joe Witt <jo...@gmail.com>.
Mike

Totally get it.  If you are able to on this or another system get back
into that state we're highly interested to learn more.  In looking at
the code relevant to your stack trace I'm not quite seeing the trail
just yet.  The problem is definitely with the persistent prov.
Getting the phased thread dumps will help tell more of the story.

Also, can you tell us anything about the volume/mount that the nifi
install and specific provenance is on?  Any interesting mount options
involving timestamps, etc..?

No rush of course and glad you're back in business.  But, you've
definitely got our attention :-)

Thanks
Joe

On Fri, Feb 17, 2017 at 12:10 AM, Mikhail Sosonkin <mi...@synack.com> wrote:
> Joe,
>
> Many thanks for the pointer on the Volatile provenance. It is, indeed, more
> critical for us that the data moves. Before receiving this message, I
> changed the config and restarted. The data started moving which is awesome!
>
> I'm happy to help you debug this issue. Do you need these collections with
> the volatile setting or persistent setting in locked state?
>
> Mike.
>
> On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:
>>
>> Mike
>>
>> One more thing...can you please grab a couple more thread dumps for us
>> with 5 to 10 mins between?
>>
>> I don't see a deadlock but do suspect either just crazy slow IO going
>> on or a possible livelock.  The thread dump will help narrow that down
>> a bit.
>>
>> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>> system too please.
>>
>> Thanks
>> Joe
>>
>> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
>> > Mike,
>> >
>> > No need for more info.  Heap/GC looks beautiful.
>> >
>> > The thread dump however, shows some problems.  The provenance
>> > repository is locked up.  Numerous threads are sitting here
>> >
>> > at
>> > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>> > at
>> > org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>> >
>> > This means these are processors committing their sessions and updating
>> > provenance but they're waiting on a readlock to provenance.  This lock
>> > cannot be obtained because a provenance maintenance thread is
>> > attempting to purge old events and cannot.
>> >
>> > I recall us having addressed this so am looking to see when that was
>> > addressed.  If provenance is not critical for you right now you can
>> > swap out the persistent implementation with the volatile provenance
>> > repository.  In nifi.properties change this line
>> >
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>> >
>> > to
>> >
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>> >
>> > The behavior reminds me of this issue which was fixed in 1.x
>> > https://issues.apache.org/jira/browse/NIFI-2395
>> >
>> > Need to dig into this more...
>> >
>> > Thanks
>> > Joe
>> >
>> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com>
>> > wrote:
>> >> Hi Joe,
>> >>
>> >> Thank you for your quick response. The system is currently in the
>> >> deadlock
>> >> state with 10 worker threads spinning. So, I'll gather the info you
>> >> requested.
>> >>
>> >> - The available space on the partition is 223G free of 500G (same as
>> >> was
>> >> available for 0.6.1)
>> >> - java.arg.3=-Xmx4096m in bootstrap.conf
>> >> - thread dump and jstats are here
>> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>> >>
>> >> Unfortunately, it's hard to predict when the decay starts and it takes
>> >> too
>> >> long to have to monitor the system manually. However, if you still
>> >> need,
>> >> after seeing the attached dumps, the thread dumps while it decays I can
>> >> set
>> >> up a timer script.
>> >>
>> >> Let me know if you need any more info.
>> >>
>> >> Thanks,
>> >> Mike.
>> >>
>> >>
>> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
>> >>>
>> >>> Mike,
>> >>>
>> >>> Can you capture a series of thread dumps as the gradual decay occurs
>> >>> and signal at what point they were generated specifically calling out
>> >>> the "now the system is doing nothing" point.  Can you check for space
>> >>> available on the system during these times as well.  Also, please
>> >>> advise on the behavior of the heap/garbage collection.  Often (not
>> >>> always) a gradual decay in performance can suggest an issue with GC as
>> >>> you know.  Can you run something like
>> >>>
>> >>> jstat -gcutil -h5 <pid> 1000
>> >>>
>> >>> And capture those rules in these chunks as well.
>> >>>
>> >>> This would give us a pretty good picture of the health of the system/
>> >>> and JVM around these times.  It is probably too much for the mailing
>> >>> list for the info so feel free to create a JIRA for this and put
>> >>> attachments there or link to gists in github/etc.
>> >>>
>> >>> Pretty confident we can get to the bottom of what you're seeing
>> >>> quickly.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mi...@synack.com>
>> >>> wrote:
>> >>> > Hello,
>> >>> >
>> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything
>> >>> > was
>> >>> > working well. However, a few hours later none of the processors were
>> >>> > showing
>> >>> > any activity. Then, I tried restarting nifi which caused some
>> >>> > flowfiles
>> >>> > to
>> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> >>> > however
>> >>> > the processors still continue to produce no activity. Next, I stop
>> >>> > the
>> >>> > service and delete all state (content_repository database_repository
>> >>> > flowfile_repository provenance_repository work). Then the processors
>> >>> > start
>> >>> > working for a few hours (maybe a day) until the deadlock occurs
>> >>> > again.
>> >>> >
>> >>> > So, this cycle continues where I have to periodically reset the
>> >>> > service
>> >>> > and
>> >>> > delete the state to get things moving. Obviously, that's not great.
>> >>> > I'll
>> >>> > note that the flow.xml file has been changed, as I added/removed
>> >>> > processors,
>> >>> > by the new version of nifi but 95% of the flow configuration is the
>> >>> > same
>> >>> > as
>> >>> > before the upgrade. So, I'm wondering if there is a configuration
>> >>> > setting
>> >>> > that causes these deadlocks.
>> >>> >
>> >>> > What I've been able to observe is that the deadlock is "gradual" in
>> >>> > that
>> >>> > my
>> >>> > flow usually takes about 4-5 threads to execute. The deadlock causes
>> >>> > the
>> >>> > worker threads to max out at the limit and I'm not even able to stop
>> >>> > any
>> >>> > processors or list queues. I also, have not seen this behavior in a
>> >>> > fresh
>> >>> > install of Nifi where the flow.xml would start out empty.
>> >>> >
>> >>> > Can you give me some advise on what to do about this? Would the
>> >>> > problem
>> >>> > be
>> >>> > resolved if I manually rebuild the flow with the new version of Nifi
>> >>> > (not
>> >>> > looking forward to that)?
>> >>> >
>> >>> > Much appreciated.
>> >>> >
>> >>> > Mike.
>> >>> >
>> >>> > This email may contain material that is confidential for the sole
>> >>> > use of
>> >>> > the
>> >>> > intended recipient(s).  Any review, reliance or distribution or
>> >>> > disclosure
>> >>> > by others without express permission is strictly prohibited.  If you
>> >>> > are
>> >>> > not
>> >>> > the intended recipient, please contact the sender and delete all
>> >>> > copies
>> >>> > of
>> >>> > this message.
>> >>
>> >>
>> >>
>> >> This email may contain material that is confidential for the sole use
>> >> of the
>> >> intended recipient(s).  Any review, reliance or distribution or
>> >> disclosure
>> >> by others without express permission is strictly prohibited.  If you
>> >> are not
>> >> the intended recipient, please contact the sender and delete all copies
>> >> of
>> >> this message.
>
>
>
> This email may contain material that is confidential for the sole use of the
> intended recipient(s).  Any review, reliance or distribution or disclosure
> by others without express permission is strictly prohibited.  If you are not
> the intended recipient, please contact the sender and delete all copies of
> this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Mikhail Sosonkin <mi...@synack.com>.
Joe,

Many thanks for the pointer on the Volatile provenance. It is, indeed, more
critical for us that the data moves. Before receiving this message, I
changed the config and restarted. The data started moving which is awesome!

I'm happy to help you debug this issue. Do you need these collections with
the volatile setting or persistent setting in locked state?

Mike.

On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <jo...@gmail.com> wrote:

> Mike
>
> One more thing...can you please grab a couple more thread dumps for us
> with 5 to 10 mins between?
>
> I don't see a deadlock but do suspect either just crazy slow IO going
> on or a possible livelock.  The thread dump will help narrow that down
> a bit.
>
> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
> system too please.
>
> Thanks
> Joe
>
> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
> > Mike,
> >
> > No need for more info.  Heap/GC looks beautiful.
> >
> > The thread dump however, shows some problems.  The provenance
> > repository is locked up.  Numerous threads are sitting here
> >
> > at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(
> ReentrantReadWriteLock.java:727)
> > at org.apache.nifi.provenance.PersistentProvenanceRepository
> .persistRecord(PersistentProvenanceRepository.java:757)
> >
> > This means these are processors committing their sessions and updating
> > provenance but they're waiting on a readlock to provenance.  This lock
> > cannot be obtained because a provenance maintenance thread is
> > attempting to purge old events and cannot.
> >
> > I recall us having addressed this so am looking to see when that was
> > addressed.  If provenance is not critical for you right now you can
> > swap out the persistent implementation with the volatile provenance
> > repository.  In nifi.properties change this line
> >
> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.
> PersistentProvenanceRepository
> >
> > to
> >
> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.
> VolatileProvenanceRepository
> >
> > The behavior reminds me of this issue which was fixed in 1.x
> > https://issues.apache.org/jira/browse/NIFI-2395
> >
> > Need to dig into this more...
> >
> > Thanks
> > Joe
> >
> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com>
> wrote:
> >> Hi Joe,
> >>
> >> Thank you for your quick response. The system is currently in the
> deadlock
> >> state with 10 worker threads spinning. So, I'll gather the info you
> >> requested.
> >>
> >> - The available space on the partition is 223G free of 500G (same as was
> >> available for 0.6.1)
> >> - java.arg.3=-Xmx4096m in bootstrap.conf
> >> - thread dump and jstats are here
> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
> >>
> >> Unfortunately, it's hard to predict when the decay starts and it takes
> too
> >> long to have to monitor the system manually. However, if you still need,
> >> after seeing the attached dumps, the thread dumps while it decays I can
> set
> >> up a timer script.
> >>
> >> Let me know if you need any more info.
> >>
> >> Thanks,
> >> Mike.
> >>
> >>
> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
> >>>
> >>> Mike,
> >>>
> >>> Can you capture a series of thread dumps as the gradual decay occurs
> >>> and signal at what point they were generated specifically calling out
> >>> the "now the system is doing nothing" point.  Can you check for space
> >>> available on the system during these times as well.  Also, please
> >>> advise on the behavior of the heap/garbage collection.  Often (not
> >>> always) a gradual decay in performance can suggest an issue with GC as
> >>> you know.  Can you run something like
> >>>
> >>> jstat -gcutil -h5 <pid> 1000
> >>>
> >>> And capture those rules in these chunks as well.
> >>>
> >>> This would give us a pretty good picture of the health of the system/
> >>> and JVM around these times.  It is probably too much for the mailing
> >>> list for the info so feel free to create a JIRA for this and put
> >>> attachments there or link to gists in github/etc.
> >>>
> >>> Pretty confident we can get to the bottom of what you're seeing
> quickly.
> >>>
> >>> Thanks
> >>> Joe
> >>>
> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mi...@synack.com>
> >>> wrote:
> >>> > Hello,
> >>> >
> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything
> was
> >>> > working well. However, a few hours later none of the processors were
> >>> > showing
> >>> > any activity. Then, I tried restarting nifi which caused some
> flowfiles
> >>> > to
> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
> >>> > however
> >>> > the processors still continue to produce no activity. Next, I stop
> the
> >>> > service and delete all state (content_repository database_repository
> >>> > flowfile_repository provenance_repository work). Then the processors
> >>> > start
> >>> > working for a few hours (maybe a day) until the deadlock occurs
> again.
> >>> >
> >>> > So, this cycle continues where I have to periodically reset the
> service
> >>> > and
> >>> > delete the state to get things moving. Obviously, that's not great.
> I'll
> >>> > note that the flow.xml file has been changed, as I added/removed
> >>> > processors,
> >>> > by the new version of nifi but 95% of the flow configuration is the
> same
> >>> > as
> >>> > before the upgrade. So, I'm wondering if there is a configuration
> >>> > setting
> >>> > that causes these deadlocks.
> >>> >
> >>> > What I've been able to observe is that the deadlock is "gradual" in
> that
> >>> > my
> >>> > flow usually takes about 4-5 threads to execute. The deadlock causes
> the
> >>> > worker threads to max out at the limit and I'm not even able to stop
> any
> >>> > processors or list queues. I also, have not seen this behavior in a
> >>> > fresh
> >>> > install of Nifi where the flow.xml would start out empty.
> >>> >
> >>> > Can you give me some advise on what to do about this? Would the
> problem
> >>> > be
> >>> > resolved if I manually rebuild the flow with the new version of Nifi
> >>> > (not
> >>> > looking forward to that)?
> >>> >
> >>> > Much appreciated.
> >>> >
> >>> > Mike.
> >>> >
> >>> > This email may contain material that is confidential for the sole
> use of
> >>> > the
> >>> > intended recipient(s).  Any review, reliance or distribution or
> >>> > disclosure
> >>> > by others without express permission is strictly prohibited.  If you
> are
> >>> > not
> >>> > the intended recipient, please contact the sender and delete all
> copies
> >>> > of
> >>> > this message.
> >>
> >>
> >>
> >> This email may contain material that is confidential for the sole use
> of the
> >> intended recipient(s).  Any review, reliance or distribution or
> disclosure
> >> by others without express permission is strictly prohibited.  If you
> are not
> >> the intended recipient, please contact the sender and delete all copies
> of
> >> this message.
>

-- 
This email may contain material that is confidential for the sole use of 
the intended recipient(s).  Any review, reliance or distribution or 
disclosure by others without express permission is strictly prohibited.  If 
you are not the intended recipient, please contact the sender and delete 
all copies of this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Joe Witt <jo...@gmail.com>.
Mike

One more thing...can you please grab a couple more thread dumps for us
with 5 to 10 mins between?

I don't see a deadlock but do suspect either just crazy slow IO going
on or a possible livelock.  The thread dump will help narrow that down
a bit.

Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
system too please.

Thanks
Joe

On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <jo...@gmail.com> wrote:
> Mike,
>
> No need for more info.  Heap/GC looks beautiful.
>
> The thread dump however, shows some problems.  The provenance
> repository is locked up.  Numerous threads are sitting here
>
> at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> at org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>
> This means these are processors committing their sessions and updating
> provenance but they're waiting on a readlock to provenance.  This lock
> cannot be obtained because a provenance maintenance thread is
> attempting to purge old events and cannot.
>
> I recall us having addressed this so am looking to see when that was
> addressed.  If provenance is not critical for you right now you can
> swap out the persistent implementation with the volatile provenance
> repository.  In nifi.properties change this line
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>
> to
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>
> The behavior reminds me of this issue which was fixed in 1.x
> https://issues.apache.org/jira/browse/NIFI-2395
>
> Need to dig into this more...
>
> Thanks
> Joe
>
> On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com> wrote:
>> Hi Joe,
>>
>> Thank you for your quick response. The system is currently in the deadlock
>> state with 10 worker threads spinning. So, I'll gather the info you
>> requested.
>>
>> - The available space on the partition is 223G free of 500G (same as was
>> available for 0.6.1)
>> - java.arg.3=-Xmx4096m in bootstrap.conf
>> - thread dump and jstats are here
>> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>>
>> Unfortunately, it's hard to predict when the decay starts and it takes too
>> long to have to monitor the system manually. However, if you still need,
>> after seeing the attached dumps, the thread dumps while it decays I can set
>> up a timer script.
>>
>> Let me know if you need any more info.
>>
>> Thanks,
>> Mike.
>>
>>
>> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
>>>
>>> Mike,
>>>
>>> Can you capture a series of thread dumps as the gradual decay occurs
>>> and signal at what point they were generated specifically calling out
>>> the "now the system is doing nothing" point.  Can you check for space
>>> available on the system during these times as well.  Also, please
>>> advise on the behavior of the heap/garbage collection.  Often (not
>>> always) a gradual decay in performance can suggest an issue with GC as
>>> you know.  Can you run something like
>>>
>>> jstat -gcutil -h5 <pid> 1000
>>>
>>> And capture those rules in these chunks as well.
>>>
>>> This would give us a pretty good picture of the health of the system/
>>> and JVM around these times.  It is probably too much for the mailing
>>> list for the info so feel free to create a JIRA for this and put
>>> attachments there or link to gists in github/etc.
>>>
>>> Pretty confident we can get to the bottom of what you're seeing quickly.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mi...@synack.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was
>>> > working well. However, a few hours later none of the processors were
>>> > showing
>>> > any activity. Then, I tried restarting nifi which caused some flowfiles
>>> > to
>>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>>> > however
>>> > the processors still continue to produce no activity. Next, I stop the
>>> > service and delete all state (content_repository database_repository
>>> > flowfile_repository provenance_repository work). Then the processors
>>> > start
>>> > working for a few hours (maybe a day) until the deadlock occurs again.
>>> >
>>> > So, this cycle continues where I have to periodically reset the service
>>> > and
>>> > delete the state to get things moving. Obviously, that's not great. I'll
>>> > note that the flow.xml file has been changed, as I added/removed
>>> > processors,
>>> > by the new version of nifi but 95% of the flow configuration is the same
>>> > as
>>> > before the upgrade. So, I'm wondering if there is a configuration
>>> > setting
>>> > that causes these deadlocks.
>>> >
>>> > What I've been able to observe is that the deadlock is "gradual" in that
>>> > my
>>> > flow usually takes about 4-5 threads to execute. The deadlock causes the
>>> > worker threads to max out at the limit and I'm not even able to stop any
>>> > processors or list queues. I also, have not seen this behavior in a
>>> > fresh
>>> > install of Nifi where the flow.xml would start out empty.
>>> >
>>> > Can you give me some advise on what to do about this? Would the problem
>>> > be
>>> > resolved if I manually rebuild the flow with the new version of Nifi
>>> > (not
>>> > looking forward to that)?
>>> >
>>> > Much appreciated.
>>> >
>>> > Mike.
>>> >
>>> > This email may contain material that is confidential for the sole use of
>>> > the
>>> > intended recipient(s).  Any review, reliance or distribution or
>>> > disclosure
>>> > by others without express permission is strictly prohibited.  If you are
>>> > not
>>> > the intended recipient, please contact the sender and delete all copies
>>> > of
>>> > this message.
>>
>>
>>
>> This email may contain material that is confidential for the sole use of the
>> intended recipient(s).  Any review, reliance or distribution or disclosure
>> by others without express permission is strictly prohibited.  If you are not
>> the intended recipient, please contact the sender and delete all copies of
>> this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Joe Witt <jo...@gmail.com>.
Mike,

No need for more info.  Heap/GC looks beautiful.

The thread dump however, shows some problems.  The provenance
repository is locked up.  Numerous threads are sitting here

at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)

This means these are processors committing their sessions and updating
provenance but they're waiting on a readlock to provenance.  This lock
cannot be obtained because a provenance maintenance thread is
attempting to purge old events and cannot.

I recall us having addressed this so am looking to see when that was
addressed.  If provenance is not critical for you right now you can
swap out the persistent implementation with the volatile provenance
repository.  In nifi.properties change this line

nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository

to

nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository

The behavior reminds me of this issue which was fixed in 1.x
https://issues.apache.org/jira/browse/NIFI-2395

Need to dig into this more...

Thanks
Joe

On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mi...@synack.com> wrote:
> Hi Joe,
>
> Thank you for your quick response. The system is currently in the deadlock
> state with 10 worker threads spinning. So, I'll gather the info you
> requested.
>
> - The available space on the partition is 223G free of 500G (same as was
> available for 0.6.1)
> - java.arg.3=-Xmx4096m in bootstrap.conf
> - thread dump and jstats are here
> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>
> Unfortunately, it's hard to predict when the decay starts and it takes too
> long to have to monitor the system manually. However, if you still need,
> after seeing the attached dumps, the thread dumps while it decays I can set
> up a timer script.
>
> Let me know if you need any more info.
>
> Thanks,
> Mike.
>
>
> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:
>>
>> Mike,
>>
>> Can you capture a series of thread dumps as the gradual decay occurs
>> and signal at what point they were generated specifically calling out
>> the "now the system is doing nothing" point.  Can you check for space
>> available on the system during these times as well.  Also, please
>> advise on the behavior of the heap/garbage collection.  Often (not
>> always) a gradual decay in performance can suggest an issue with GC as
>> you know.  Can you run something like
>>
>> jstat -gcutil -h5 <pid> 1000
>>
>> And capture those rules in these chunks as well.
>>
>> This would give us a pretty good picture of the health of the system/
>> and JVM around these times.  It is probably too much for the mailing
>> list for the info so feel free to create a JIRA for this and put
>> attachments there or link to gists in github/etc.
>>
>> Pretty confident we can get to the bottom of what you're seeing quickly.
>>
>> Thanks
>> Joe
>>
>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mi...@synack.com>
>> wrote:
>> > Hello,
>> >
>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was
>> > working well. However, a few hours later none of the processors were
>> > showing
>> > any activity. Then, I tried restarting nifi which caused some flowfiles
>> > to
>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> > however
>> > the processors still continue to produce no activity. Next, I stop the
>> > service and delete all state (content_repository database_repository
>> > flowfile_repository provenance_repository work). Then the processors
>> > start
>> > working for a few hours (maybe a day) until the deadlock occurs again.
>> >
>> > So, this cycle continues where I have to periodically reset the service
>> > and
>> > delete the state to get things moving. Obviously, that's not great. I'll
>> > note that the flow.xml file has been changed, as I added/removed
>> > processors,
>> > by the new version of nifi but 95% of the flow configuration is the same
>> > as
>> > before the upgrade. So, I'm wondering if there is a configuration
>> > setting
>> > that causes these deadlocks.
>> >
>> > What I've been able to observe is that the deadlock is "gradual" in that
>> > my
>> > flow usually takes about 4-5 threads to execute. The deadlock causes the
>> > worker threads to max out at the limit and I'm not even able to stop any
>> > processors or list queues. I also, have not seen this behavior in a
>> > fresh
>> > install of Nifi where the flow.xml would start out empty.
>> >
>> > Can you give me some advise on what to do about this? Would the problem
>> > be
>> > resolved if I manually rebuild the flow with the new version of Nifi
>> > (not
>> > looking forward to that)?
>> >
>> > Much appreciated.
>> >
>> > Mike.
>> >
>> > This email may contain material that is confidential for the sole use of
>> > the
>> > intended recipient(s).  Any review, reliance or distribution or
>> > disclosure
>> > by others without express permission is strictly prohibited.  If you are
>> > not
>> > the intended recipient, please contact the sender and delete all copies
>> > of
>> > this message.
>
>
>
> This email may contain material that is confidential for the sole use of the
> intended recipient(s).  Any review, reliance or distribution or disclosure
> by others without express permission is strictly prohibited.  If you are not
> the intended recipient, please contact the sender and delete all copies of
> this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Mikhail Sosonkin <mi...@synack.com>.
Hi Joe,

Thank you for your quick response. The system is currently in the deadlock
state with 10 worker threads spinning. So, I'll gather the info you
requested.

- The available space on the partition is 223G free of 500G (same as was
available for 0.6.1)
- java.arg.3=-Xmx4096m in bootstrap.conf
- thread dump and jstats are here
https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085

Unfortunately, it's hard to predict when the decay starts and it takes too
long to have to monitor the system manually. However, if you still need,
after seeing the attached dumps, the thread dumps while it decays I can set
up a timer script.

Let me know if you need any more info.

Thanks,
Mike.


On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <jo...@gmail.com> wrote:

> Mike,
>
> Can you capture a series of thread dumps as the gradual decay occurs
> and signal at what point they were generated specifically calling out
> the "now the system is doing nothing" point.  Can you check for space
> available on the system during these times as well.  Also, please
> advise on the behavior of the heap/garbage collection.  Often (not
> always) a gradual decay in performance can suggest an issue with GC as
> you know.  Can you run something like
>
> jstat -gcutil -h5 <pid> 1000
>
> And capture those rules in these chunks as well.
>
> This would give us a pretty good picture of the health of the system/
> and JVM around these times.  It is probably too much for the mailing
> list for the info so feel free to create a JIRA for this and put
> attachments there or link to gists in github/etc.
>
> Pretty confident we can get to the bottom of what you're seeing quickly.
>
> Thanks
> Joe
>
> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mi...@synack.com>
> wrote:
> > Hello,
> >
> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was
> > working well. However, a few hours later none of the processors were
> showing
> > any activity. Then, I tried restarting nifi which caused some flowfiles
> to
> > get corrupted evidenced by exceptions thrown in the nifi-app.log, however
> > the processors still continue to produce no activity. Next, I stop the
> > service and delete all state (content_repository database_repository
> > flowfile_repository provenance_repository work). Then the processors
> start
> > working for a few hours (maybe a day) until the deadlock occurs again.
> >
> > So, this cycle continues where I have to periodically reset the service
> and
> > delete the state to get things moving. Obviously, that's not great. I'll
> > note that the flow.xml file has been changed, as I added/removed
> processors,
> > by the new version of nifi but 95% of the flow configuration is the same
> as
> > before the upgrade. So, I'm wondering if there is a configuration setting
> > that causes these deadlocks.
> >
> > What I've been able to observe is that the deadlock is "gradual" in that
> my
> > flow usually takes about 4-5 threads to execute. The deadlock causes the
> > worker threads to max out at the limit and I'm not even able to stop any
> > processors or list queues. I also, have not seen this behavior in a fresh
> > install of Nifi where the flow.xml would start out empty.
> >
> > Can you give me some advise on what to do about this? Would the problem
> be
> > resolved if I manually rebuild the flow with the new version of Nifi (not
> > looking forward to that)?
> >
> > Much appreciated.
> >
> > Mike.
> >
> > This email may contain material that is confidential for the sole use of
> the
> > intended recipient(s).  Any review, reliance or distribution or
> disclosure
> > by others without express permission is strictly prohibited.  If you are
> not
> > the intended recipient, please contact the sender and delete all copies
> of
> > this message.
>

-- 
This email may contain material that is confidential for the sole use of 
the intended recipient(s).  Any review, reliance or distribution or 
disclosure by others without express permission is strictly prohibited.  If 
you are not the intended recipient, please contact the sender and delete 
all copies of this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Posted by Joe Witt <jo...@gmail.com>.
Mike,

Can you capture a series of thread dumps as the gradual decay occurs
and signal at what point they were generated specifically calling out
the "now the system is doing nothing" point.  Can you check for space
available on the system during these times as well.  Also, please
advise on the behavior of the heap/garbage collection.  Often (not
always) a gradual decay in performance can suggest an issue with GC as
you know.  Can you run something like

jstat -gcutil -h5 <pid> 1000

And capture those rules in these chunks as well.

This would give us a pretty good picture of the health of the system/
and JVM around these times.  It is probably too much for the mailing
list for the info so feel free to create a JIRA for this and put
attachments there or link to gists in github/etc.

Pretty confident we can get to the bottom of what you're seeing quickly.

Thanks
Joe

On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mi...@synack.com> wrote:
> Hello,
>
> Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was
> working well. However, a few hours later none of the processors were showing
> any activity. Then, I tried restarting nifi which caused some flowfiles to
> get corrupted evidenced by exceptions thrown in the nifi-app.log, however
> the processors still continue to produce no activity. Next, I stop the
> service and delete all state (content_repository database_repository
> flowfile_repository provenance_repository work). Then the processors start
> working for a few hours (maybe a day) until the deadlock occurs again.
>
> So, this cycle continues where I have to periodically reset the service and
> delete the state to get things moving. Obviously, that's not great. I'll
> note that the flow.xml file has been changed, as I added/removed processors,
> by the new version of nifi but 95% of the flow configuration is the same as
> before the upgrade. So, I'm wondering if there is a configuration setting
> that causes these deadlocks.
>
> What I've been able to observe is that the deadlock is "gradual" in that my
> flow usually takes about 4-5 threads to execute. The deadlock causes the
> worker threads to max out at the limit and I'm not even able to stop any
> processors or list queues. I also, have not seen this behavior in a fresh
> install of Nifi where the flow.xml would start out empty.
>
> Can you give me some advise on what to do about this? Would the problem be
> resolved if I manually rebuild the flow with the new version of Nifi (not
> looking forward to that)?
>
> Much appreciated.
>
> Mike.
>
> This email may contain material that is confidential for the sole use of the
> intended recipient(s).  Any review, reliance or distribution or disclosure
> by others without express permission is strictly prohibited.  If you are not
> the intended recipient, please contact the sender and delete all copies of
> this message.