You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Mark Bean <ma...@gmail.com> on 2022/06/10 17:16:27 UTC

possible load balancing issue

We have a situation where several flowfiles have lost their content. They
still appear on the graph, but any attempt by a processor to access content
results in a NullPointerException. The identified content claim file is in
fact missing from the file system.

Also, there are ERROR log messages indicating the claimant count is a
negative value.

o.a.n.c.r.c.StandardResourceClaimManager Decremented claimant count for
StandardResourceClaim[id=1234-567, containter=default, section=890] to -1

(There are also some with negative values as low as -4.)

Anecdotally, we are suspecting this may have been caused by incomplete
connection load balance. And, if this is the case, it is not clear if the
content successfully reached another Node and the FlowFile simply didn't
finish cleaning up, or if content was prematurely dropped.

It should be noted that the cluster was upgraded/restarted at or about the
time the errors started. Could a shutdown of NiFi cause data loss if a load
balance was currently in progress?

NiFi 1.14.0

Thanks,
Mark

Re: possible load balancing issue

Posted by Mark Bean <ma...@gmail.com>.
Forgot this fun fact: while we're still on 1.14.0 on this particular NiFi
instance, it is a forked version which includes a cherry-pick of NIFI-9433.
So, it seems likely this is still a separate issue.

Again, we hope to reproduce in a test environment next week. An important
step is to determine if this is a load balancer related issue or
something else.


On Fri, Jun 10, 2022 at 3:13 PM Mark Bean <ma...@gmail.com> wrote:

> Yes, it will be a few weeks at least to get the upgrade into the
> environment where we see this occurring and eavluate. Part of the problem
> is reproducibility. We haven't yet created a scenario that reliably forces
> this situation. That's on the list for Monday though. If we can reliably
> reproduce, I'm sure we can test much sooner with 1.16.2 to confirm it's
> been addressed - even if we don't yet have 1.16.2 in the target environment.
>
> Will report back findings when available.
>
> Thanks,
> Mark
>
>
> On Fri, Jun 10, 2022 at 2:53 PM Joe Witt <jo...@gmail.com> wrote:
>
>> Mark
>>
>> I will be a few weeks before you can evaluate this?
>>
>> thanks
>>
>> On Fri, Jun 10, 2022 at 11:03 AM Joe Witt <jo...@gmail.com> wrote:
>>
>> > MarkB
>> >
>> > That is why MarkP said it was a manifestation.  Point is the issue you
>> > noted, specifically the behavior you saw here (and before) is believed
>> to
>> > be addressed in that fix which went into the release 6 months ago and is
>> > also in the 1.16.x line.  You'll want that and of course the many other
>> > improvements to have improved behavior for this scenario.
>> >
>> > Thanks
>> >
>> > On Fri, Jun 10, 2022 at 10:59 AM Mark Bean <ma...@gmail.com>
>> wrote:
>> >
>> >> This is not quite the same issue. It's possible the fix for NIFI-9433
>> may
>> >> be related. But, the set of circumstances are definitely different.
>> Also,
>> >> the observed behavior is different. For example, none of the nodes
>> report
>> >> "
>> >> Cannot create negative queue size".
>> >>
>> >> I'm trying to track specific FlowFile(s) from one node to another
>> during
>> >> load balancing. And, I have been unsuccessful. In other words, I have
>> not
>> >> been able to confirm whether a given FlowFile was successfully
>> transferred
>> >> or not. Provenance is no longer available for this time period. I know,
>> >> not
>> >> good answers for diagnosing the issue.
>> >>
>> >> My real question is what is the expected behavior for FlowFiles that
>> are
>> >> actively load balancing and the cluster is shutdown?
>> >>
>> >> We have plans to upgrade as soon as possible, but unfortunately, that
>> will
>> >> not be for at least a few more weeks due to the need to integrate
>> custom
>> >> changes into 1.16.2.
>> >>
>> >>
>> >> On Fri, Jun 10, 2022 at 1:31 PM Mark Payne <ma...@hotmail.com>
>> wrote:
>> >>
>> >> > Mark,
>> >> >
>> >> > This is a manifestation of NIFI-9433 [1] that we fixed a while back.
>> >> > Recommend you upgrade your installation.
>> >> >
>> >> > Thanks
>> >> > -Mark
>> >> >
>> >> >
>> >> > [1] https://issues.apache.org/jira/browse/NIFI-9433
>> >> >
>> >> >
>> >> > On Jun 10, 2022, at 1:16 PM, Mark Bean <mark.o.bean@gmail.com
>> <mailto:
>> >> > mark.o.bean@gmail.com>> wrote:
>> >> >
>> >> > We have a situation where several flowfiles have lost their content.
>> >> They
>> >> > still appear on the graph, but any attempt by a processor to access
>> >> content
>> >> > results in a NullPointerException. The identified content claim file
>> is
>> >> in
>> >> > fact missing from the file system.
>> >> >
>> >> > Also, there are ERROR log messages indicating the claimant count is a
>> >> > negative value.
>> >> >
>> >> > o.a.n.c.r.c.StandardResourceClaimManager Decremented claimant count
>> for
>> >> > StandardResourceClaim[id=1234-567, containter=default, section=890]
>> to
>> >> -1
>> >> >
>> >> > (There are also some with negative values as low as -4.)
>> >> >
>> >> > Anecdotally, we are suspecting this may have been caused by
>> incomplete
>> >> > connection load balance. And, if this is the case, it is not clear if
>> >> the
>> >> > content successfully reached another Node and the FlowFile simply
>> didn't
>> >> > finish cleaning up, or if content was prematurely dropped.
>> >> >
>> >> > It should be noted that the cluster was upgraded/restarted at or
>> about
>> >> the
>> >> > time the errors started. Could a shutdown of NiFi cause data loss if
>> a
>> >> load
>> >> > balance was currently in progress?
>> >> >
>> >> > NiFi 1.14.0
>> >> >
>> >> > Thanks,
>> >> > Mark
>> >> >
>> >> >
>> >>
>> >
>>
>

Re: possible load balancing issue

Posted by Mark Bean <ma...@gmail.com>.
Yes, it will be a few weeks at least to get the upgrade into the
environment where we see this occurring and eavluate. Part of the problem
is reproducibility. We haven't yet created a scenario that reliably forces
this situation. That's on the list for Monday though. If we can reliably
reproduce, I'm sure we can test much sooner with 1.16.2 to confirm it's
been addressed - even if we don't yet have 1.16.2 in the target environment.

Will report back findings when available.

Thanks,
Mark


On Fri, Jun 10, 2022 at 2:53 PM Joe Witt <jo...@gmail.com> wrote:

> Mark
>
> I will be a few weeks before you can evaluate this?
>
> thanks
>
> On Fri, Jun 10, 2022 at 11:03 AM Joe Witt <jo...@gmail.com> wrote:
>
> > MarkB
> >
> > That is why MarkP said it was a manifestation.  Point is the issue you
> > noted, specifically the behavior you saw here (and before) is believed to
> > be addressed in that fix which went into the release 6 months ago and is
> > also in the 1.16.x line.  You'll want that and of course the many other
> > improvements to have improved behavior for this scenario.
> >
> > Thanks
> >
> > On Fri, Jun 10, 2022 at 10:59 AM Mark Bean <ma...@gmail.com>
> wrote:
> >
> >> This is not quite the same issue. It's possible the fix for NIFI-9433
> may
> >> be related. But, the set of circumstances are definitely different.
> Also,
> >> the observed behavior is different. For example, none of the nodes
> report
> >> "
> >> Cannot create negative queue size".
> >>
> >> I'm trying to track specific FlowFile(s) from one node to another during
> >> load balancing. And, I have been unsuccessful. In other words, I have
> not
> >> been able to confirm whether a given FlowFile was successfully
> transferred
> >> or not. Provenance is no longer available for this time period. I know,
> >> not
> >> good answers for diagnosing the issue.
> >>
> >> My real question is what is the expected behavior for FlowFiles that are
> >> actively load balancing and the cluster is shutdown?
> >>
> >> We have plans to upgrade as soon as possible, but unfortunately, that
> will
> >> not be for at least a few more weeks due to the need to integrate custom
> >> changes into 1.16.2.
> >>
> >>
> >> On Fri, Jun 10, 2022 at 1:31 PM Mark Payne <ma...@hotmail.com>
> wrote:
> >>
> >> > Mark,
> >> >
> >> > This is a manifestation of NIFI-9433 [1] that we fixed a while back.
> >> > Recommend you upgrade your installation.
> >> >
> >> > Thanks
> >> > -Mark
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/NIFI-9433
> >> >
> >> >
> >> > On Jun 10, 2022, at 1:16 PM, Mark Bean <mark.o.bean@gmail.com<mailto:
> >> > mark.o.bean@gmail.com>> wrote:
> >> >
> >> > We have a situation where several flowfiles have lost their content.
> >> They
> >> > still appear on the graph, but any attempt by a processor to access
> >> content
> >> > results in a NullPointerException. The identified content claim file
> is
> >> in
> >> > fact missing from the file system.
> >> >
> >> > Also, there are ERROR log messages indicating the claimant count is a
> >> > negative value.
> >> >
> >> > o.a.n.c.r.c.StandardResourceClaimManager Decremented claimant count
> for
> >> > StandardResourceClaim[id=1234-567, containter=default, section=890] to
> >> -1
> >> >
> >> > (There are also some with negative values as low as -4.)
> >> >
> >> > Anecdotally, we are suspecting this may have been caused by incomplete
> >> > connection load balance. And, if this is the case, it is not clear if
> >> the
> >> > content successfully reached another Node and the FlowFile simply
> didn't
> >> > finish cleaning up, or if content was prematurely dropped.
> >> >
> >> > It should be noted that the cluster was upgraded/restarted at or about
> >> the
> >> > time the errors started. Could a shutdown of NiFi cause data loss if a
> >> load
> >> > balance was currently in progress?
> >> >
> >> > NiFi 1.14.0
> >> >
> >> > Thanks,
> >> > Mark
> >> >
> >> >
> >>
> >
>

Re: possible load balancing issue

Posted by Joe Witt <jo...@gmail.com>.
Mark

I will be a few weeks before you can evaluate this?

thanks

On Fri, Jun 10, 2022 at 11:03 AM Joe Witt <jo...@gmail.com> wrote:

> MarkB
>
> That is why MarkP said it was a manifestation.  Point is the issue you
> noted, specifically the behavior you saw here (and before) is believed to
> be addressed in that fix which went into the release 6 months ago and is
> also in the 1.16.x line.  You'll want that and of course the many other
> improvements to have improved behavior for this scenario.
>
> Thanks
>
> On Fri, Jun 10, 2022 at 10:59 AM Mark Bean <ma...@gmail.com> wrote:
>
>> This is not quite the same issue. It's possible the fix for NIFI-9433 may
>> be related. But, the set of circumstances are definitely different. Also,
>> the observed behavior is different. For example, none of the nodes report
>> "
>> Cannot create negative queue size".
>>
>> I'm trying to track specific FlowFile(s) from one node to another during
>> load balancing. And, I have been unsuccessful. In other words, I have not
>> been able to confirm whether a given FlowFile was successfully transferred
>> or not. Provenance is no longer available for this time period. I know,
>> not
>> good answers for diagnosing the issue.
>>
>> My real question is what is the expected behavior for FlowFiles that are
>> actively load balancing and the cluster is shutdown?
>>
>> We have plans to upgrade as soon as possible, but unfortunately, that will
>> not be for at least a few more weeks due to the need to integrate custom
>> changes into 1.16.2.
>>
>>
>> On Fri, Jun 10, 2022 at 1:31 PM Mark Payne <ma...@hotmail.com> wrote:
>>
>> > Mark,
>> >
>> > This is a manifestation of NIFI-9433 [1] that we fixed a while back.
>> > Recommend you upgrade your installation.
>> >
>> > Thanks
>> > -Mark
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/NIFI-9433
>> >
>> >
>> > On Jun 10, 2022, at 1:16 PM, Mark Bean <mark.o.bean@gmail.com<mailto:
>> > mark.o.bean@gmail.com>> wrote:
>> >
>> > We have a situation where several flowfiles have lost their content.
>> They
>> > still appear on the graph, but any attempt by a processor to access
>> content
>> > results in a NullPointerException. The identified content claim file is
>> in
>> > fact missing from the file system.
>> >
>> > Also, there are ERROR log messages indicating the claimant count is a
>> > negative value.
>> >
>> > o.a.n.c.r.c.StandardResourceClaimManager Decremented claimant count for
>> > StandardResourceClaim[id=1234-567, containter=default, section=890] to
>> -1
>> >
>> > (There are also some with negative values as low as -4.)
>> >
>> > Anecdotally, we are suspecting this may have been caused by incomplete
>> > connection load balance. And, if this is the case, it is not clear if
>> the
>> > content successfully reached another Node and the FlowFile simply didn't
>> > finish cleaning up, or if content was prematurely dropped.
>> >
>> > It should be noted that the cluster was upgraded/restarted at or about
>> the
>> > time the errors started. Could a shutdown of NiFi cause data loss if a
>> load
>> > balance was currently in progress?
>> >
>> > NiFi 1.14.0
>> >
>> > Thanks,
>> > Mark
>> >
>> >
>>
>

Re: possible load balancing issue

Posted by Joe Witt <jo...@gmail.com>.
MarkB

That is why MarkP said it was a manifestation.  Point is the issue you
noted, specifically the behavior you saw here (and before) is believed to
be addressed in that fix which went into the release 6 months ago and is
also in the 1.16.x line.  You'll want that and of course the many other
improvements to have improved behavior for this scenario.

Thanks

On Fri, Jun 10, 2022 at 10:59 AM Mark Bean <ma...@gmail.com> wrote:

> This is not quite the same issue. It's possible the fix for NIFI-9433 may
> be related. But, the set of circumstances are definitely different. Also,
> the observed behavior is different. For example, none of the nodes report "
> Cannot create negative queue size".
>
> I'm trying to track specific FlowFile(s) from one node to another during
> load balancing. And, I have been unsuccessful. In other words, I have not
> been able to confirm whether a given FlowFile was successfully transferred
> or not. Provenance is no longer available for this time period. I know, not
> good answers for diagnosing the issue.
>
> My real question is what is the expected behavior for FlowFiles that are
> actively load balancing and the cluster is shutdown?
>
> We have plans to upgrade as soon as possible, but unfortunately, that will
> not be for at least a few more weeks due to the need to integrate custom
> changes into 1.16.2.
>
>
> On Fri, Jun 10, 2022 at 1:31 PM Mark Payne <ma...@hotmail.com> wrote:
>
> > Mark,
> >
> > This is a manifestation of NIFI-9433 [1] that we fixed a while back.
> > Recommend you upgrade your installation.
> >
> > Thanks
> > -Mark
> >
> >
> > [1] https://issues.apache.org/jira/browse/NIFI-9433
> >
> >
> > On Jun 10, 2022, at 1:16 PM, Mark Bean <mark.o.bean@gmail.com<mailto:
> > mark.o.bean@gmail.com>> wrote:
> >
> > We have a situation where several flowfiles have lost their content. They
> > still appear on the graph, but any attempt by a processor to access
> content
> > results in a NullPointerException. The identified content claim file is
> in
> > fact missing from the file system.
> >
> > Also, there are ERROR log messages indicating the claimant count is a
> > negative value.
> >
> > o.a.n.c.r.c.StandardResourceClaimManager Decremented claimant count for
> > StandardResourceClaim[id=1234-567, containter=default, section=890] to -1
> >
> > (There are also some with negative values as low as -4.)
> >
> > Anecdotally, we are suspecting this may have been caused by incomplete
> > connection load balance. And, if this is the case, it is not clear if the
> > content successfully reached another Node and the FlowFile simply didn't
> > finish cleaning up, or if content was prematurely dropped.
> >
> > It should be noted that the cluster was upgraded/restarted at or about
> the
> > time the errors started. Could a shutdown of NiFi cause data loss if a
> load
> > balance was currently in progress?
> >
> > NiFi 1.14.0
> >
> > Thanks,
> > Mark
> >
> >
>

Re: possible load balancing issue

Posted by Mark Bean <ma...@gmail.com>.
This is not quite the same issue. It's possible the fix for NIFI-9433 may
be related. But, the set of circumstances are definitely different. Also,
the observed behavior is different. For example, none of the nodes report "
Cannot create negative queue size".

I'm trying to track specific FlowFile(s) from one node to another during
load balancing. And, I have been unsuccessful. In other words, I have not
been able to confirm whether a given FlowFile was successfully transferred
or not. Provenance is no longer available for this time period. I know, not
good answers for diagnosing the issue.

My real question is what is the expected behavior for FlowFiles that are
actively load balancing and the cluster is shutdown?

We have plans to upgrade as soon as possible, but unfortunately, that will
not be for at least a few more weeks due to the need to integrate custom
changes into 1.16.2.


On Fri, Jun 10, 2022 at 1:31 PM Mark Payne <ma...@hotmail.com> wrote:

> Mark,
>
> This is a manifestation of NIFI-9433 [1] that we fixed a while back.
> Recommend you upgrade your installation.
>
> Thanks
> -Mark
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-9433
>
>
> On Jun 10, 2022, at 1:16 PM, Mark Bean <mark.o.bean@gmail.com<mailto:
> mark.o.bean@gmail.com>> wrote:
>
> We have a situation where several flowfiles have lost their content. They
> still appear on the graph, but any attempt by a processor to access content
> results in a NullPointerException. The identified content claim file is in
> fact missing from the file system.
>
> Also, there are ERROR log messages indicating the claimant count is a
> negative value.
>
> o.a.n.c.r.c.StandardResourceClaimManager Decremented claimant count for
> StandardResourceClaim[id=1234-567, containter=default, section=890] to -1
>
> (There are also some with negative values as low as -4.)
>
> Anecdotally, we are suspecting this may have been caused by incomplete
> connection load balance. And, if this is the case, it is not clear if the
> content successfully reached another Node and the FlowFile simply didn't
> finish cleaning up, or if content was prematurely dropped.
>
> It should be noted that the cluster was upgraded/restarted at or about the
> time the errors started. Could a shutdown of NiFi cause data loss if a load
> balance was currently in progress?
>
> NiFi 1.14.0
>
> Thanks,
> Mark
>
>

Re: possible load balancing issue

Posted by Mark Payne <ma...@hotmail.com>.
Mark,

This is a manifestation of NIFI-9433 [1] that we fixed a while back.
Recommend you upgrade your installation.

Thanks
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-9433


On Jun 10, 2022, at 1:16 PM, Mark Bean <ma...@gmail.com>> wrote:

We have a situation where several flowfiles have lost their content. They
still appear on the graph, but any attempt by a processor to access content
results in a NullPointerException. The identified content claim file is in
fact missing from the file system.

Also, there are ERROR log messages indicating the claimant count is a
negative value.

o.a.n.c.r.c.StandardResourceClaimManager Decremented claimant count for
StandardResourceClaim[id=1234-567, containter=default, section=890] to -1

(There are also some with negative values as low as -4.)

Anecdotally, we are suspecting this may have been caused by incomplete
connection load balance. And, if this is the case, it is not clear if the
content successfully reached another Node and the FlowFile simply didn't
finish cleaning up, or if content was prematurely dropped.

It should be noted that the cluster was upgraded/restarted at or about the
time the errors started. Could a shutdown of NiFi cause data loss if a load
balance was currently in progress?

NiFi 1.14.0

Thanks,
Mark