You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by dan young <da...@gmail.com> on 2019/01/06 01:08:35 UTC

Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

Just a FYI, so far so good...after switching over to your recommendation we
haven't seen any "stuck" flowfiles.  Things looking good so far and will
look forward to 1.9.


Regards,

Dano

On Fri, Dec 28, 2018 at 8:43 AM Mark Payne <ma...@hotmail.com> wrote:

> Dan, et al,
>
> Great news! I was able to replicate this issue finally, by creating a
> Load-Balanced connection
> between two Process Groups/Ports instead of between two processors. The
> fact that it's between
> two Ports does not, in and of itself, matter. But there is a race
> condition, and Ports do no actual
> Processing of the FlowFile (simply pull it from one queue and transfer it
> to another). As a result, because
> it is extremely fast, it is more likely to trigger the race condition.
>
> So I created a JIRA [1] and have submitted a PR for it.
>
> Interestingly, while there is no real workaround that is fool-proof, until
> this fix is in and released, you could
> choose to update your flow so that the connection between Process Groups
> is not load balanced and instead
> the connection between the Input Port and the first Processor is load
> balanced. Again, this is not fool-proof,
> because it could affect the Load Balanced Connection even if it is
> connected to a Processor, but it is less likely
> to do so, so you would likely see the issue occur far less often.
>
> Thank you so much for sticking with us all as we diagnose this and figure
> it all out - would not have been able to
> figure it out without you spending the time to debug the issue!
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-5919
>
>
> On Dec 26, 2018, at 10:31 PM, dan young <da...@gmail.com> wrote:
>
> Hello Mark,
>
> I just stopped the destination processor, and then disconnected the node
> in question (nifi1-1). Once I disconnected the node, the flow file in the
> load balance connection disappeared from the queue.  After that, I
> reconnected the node (with the downstream processor disconnected) and once
> the node successfully rejoined the cluster, the flowfile showed up in the
> queue again. After this, I started the connected downstream processor, but
> the flowfile stays in the queue. The only way to clear the queue is if I
> actually restart the node.  If I disconnect the node, and then restart that
> node, the flowfile is no longer present in the queue.
>
> Regards,
>
> Dano
>
>
> On Wed, Dec 26, 2018 at 6:13 PM Mark Payne <ma...@hotmail.com> wrote:
>
>> Ok, I just wanted to confirm that when you said “once it rejoins the
>> cluster that flow file is gone” that you mean “the flowfile did not exist
>> on the system” and NOT “the queue size was 0 by the time that I looked at
>> the UI.” I.e., is it possible that the FlowFile did exist, was restored,
>> and then was processed before you looked at the UI? Or the FlowFile
>> definitely did not exist after the node was restarted? That’s why I was
>> suggesting that you restart with the connection’s source and destination
>> stopped. Just to make sure that the FlowFile didn’t just get processed
>> quickly on restart.
>>
>> Sent from my iPhone
>>
>> On Dec 26, 2018, at 7:55 PM, dan young <da...@gmail.com> wrote:
>>
>> Heya Mark,
>>
>> If we restart the node, that "stuck" flowfile will disappear. This is the
>> only way so far to clear out the flowfile. I usually disconnect the node,
>> then once it's disconnected I restart nifi, and then once it rejoins the
>> cluster that flow file is gone. If we try to empty the queue, it will just
>> say that there no flow files in the queue.
>>
>>
>> On Wed, Dec 26, 2018, 5:22 PM Mark Payne <markap14@hotmail.com wrote:
>>
>>> Hey Dan,
>>>
>>> Thanks, this is super useful! So, the following section is the damning
>>> part of the JSON:
>>>
>>>           {
>>>             "totalFlowFileCount": 1,
>>>             "totalByteCount": 975890,
>>>             "nodeIdentifier": "nifi1-1:9443",
>>>             "localQueuePartition": {
>>>               "totalFlowFileCount": 0,
>>>               "totalByteCount": 0,
>>>               "activeQueueFlowFileCount": 0,
>>>               "activeQueueByteCount": 0,
>>>               "swapFlowFileCount": 0,
>>>               "swapByteCount": 0,
>>>               "swapFiles": 0,
>>>               "inFlightFlowFileCount": 0,
>>>               "inFlightByteCount": 0,
>>>               "allActiveQueueFlowFilesPenalized": false,
>>>               "anyActiveQueueFlowFilesPenalized": false
>>>             },
>>>             "remoteQueuePartitions": [
>>>               {
>>>                 "totalFlowFileCount": 0,
>>>                 "totalByteCount": 0,
>>>                 "activeQueueFlowFileCount": 0,
>>>                 "activeQueueByteCount": 0,
>>>                 "swapFlowFileCount": 0,
>>>                 "swapByteCount": 0,
>>>                 "swapFiles": 0,
>>>                 "inFlightFlowFileCount": 0,
>>>                 "inFlightByteCount": 0,
>>>                 "nodeIdentifier": "nifi2-1:9443"
>>>               },
>>>               {
>>>                 "totalFlowFileCount": 0,
>>>                 "totalByteCount": 0,
>>>                 "activeQueueFlowFileCount": 0,
>>>                 "activeQueueByteCount": 0,
>>>                 "swapFlowFileCount": 0,
>>>                 "swapByteCount": 0,
>>>                 "swapFiles": 0,
>>>                 "inFlightFlowFileCount": 0,
>>>                 "inFlightByteCount": 0,
>>>                 "nodeIdentifier": "nifi3-1:9443"
>>>               }
>>>             ]
>>>           }
>>>
>>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890
>>> bytes. But it also shows that the FlowFile is not in the "local partition"
>>> or either of the two "remote partitions." So that leaves us with two
>>> possibilities:
>>>
>>> 1) The Queue's Count is wrong, because it somehow did not get
>>> decremented (perhaps a threading bug?)
>>>
>>> Or
>>>
>>> 2) The Count is correct and the FlowFile exists, but somehow the
>>> reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a
>>> threading bug?)
>>>
>>> If possible, I would for you to stop both the source and destination of
>>> that connection and then restart node nifi1-1. Once it has restarted, check
>>> if the FlowFile is still in the connection. That will tell us which of the
>>> two above scenarios is taking place. If the FlowFile exists upon restart,
>>> then the Queue somehow lost the handle to it. If the FlowFile does not
>>> exist in the connection upon restart (I'm guessing this will be the case),
>>> then it indicates that somehow the count is incorrect.
>>>
>>> Many thanks
>>> -Mark
>>>
>>> ------------------------------
>>> *From:* dan young <da...@gmail.com>
>>> *Sent:* Wednesday, December 26, 2018 9:18 AM
>>> *To:* NiFi Mailing List
>>> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> Heya Mark,
>>>
>>> So I added a Log Attribute Processor and routed the connection that had
>>> the "stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute
>>> processor before I started it, and then ran another diagnostics after I
>>> started it.  The flowfile stayed in the load balanced connection/queue.
>>> I've attached both files.  Please LMK if this helps.
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>> On Mon, Dec 24, 2018 at 10:35 AM Mark Payne <ma...@hotmail.com>
>>> wrote:
>>>
>>> Dan,
>>>
>>> You would want to get diagnostics for the processor that is the
>>> source/destination of the connection - not the FlowFile. But if you
>>> connection is connecting 2 process groups then both its source and
>>> destination are Ports, not Processors. So the easiest thing to do would be
>>> to drop a “dummy processor” into the flow between the 2 groups, drag the
>>> Connection to that processor, get diagnostics for the processor, and then
>>> drag it back to where it was. Does that make sense? Sorry for the hassle.
>>>
>>> Thanks
>>> -Mark
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 24, 2018, at 11:40 AM, dan young <da...@gmail.com> wrote:
>>>
>>> Hello Bryan,
>>>
>>> Thank you, that was the ticket!
>>>
>>> Mark, I was able to run the diagnostics for a processor that's
>>> downstream from the connection where the flowfile appears to be "stuck".
>>> I'm not sure what processor is the source of this particular "stuck"
>>> flowfile since we have a number of upstream processor groups (PG) that feed
>>> into a funnel.  This funnel is then connected to a downstream PG. It is
>>> this connection between the funnel and a downstream PG where the flowfile
>>> is stuck. I might reduce the upstream "load balanced connections" between
>>> the various PGs to just one so I can narrow where we need to run
>>> diagnostics....  If this isn't the correct processor to be gathering
>>> diagnostics, please LMK where else I should look or other diagnostics to
>>> run...
>>>
>>> I've also attached the output (nifi-api/connections/{id}) of the get for
>>> that connection where the flowfile appears to be "stuck"
>>>
>>> On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende <bb...@gmail.com> wrote:
>>>
>>> You’ll need to get the token that was obtained when you logged in to the
>>> SSO and submit it on the curl requests the same way the UI is doing on all
>>> requests.
>>>
>>> You should be able to open chrome dev tool tools while in the UI and
>>> look at one of the request/responses and copy the value of the 'Authorization’
>>> header which should be in the form ‘Bearer <token>’.
>>>
>>> Then send this on the curl command by specifying a header of -H
>>> 'Authorization: Bearer <token>'
>>>
>>> On Sun, Dec 23, 2018 at 6:28 PM dan young <da...@gmail.com> wrote:
>>>
>>> I forgot to mention that we're using the OpenId Connect SSO .  Is there
>>> a way to run these command via curl when we have the cluster configured
>>> this way?  If so would anyone be able to provide some insight/examples.
>>>
>>> Happy Holidays!
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>> On Sun, Dec 23, 2018 at 3:53 PM dan young <da...@gmail.com> wrote:
>>>
>>> This is what I'm seeing in the logs when I try to access
>>> the nifi-api/flow/about for example...
>>>
>>> 2018-12-23 22:51:45,579 INFO [NiFi Web Server-24201]
>>> o.a.n.w.s.NiFiAuthenticationFilter Authentication success for
>>> dan@looker.com
>>> 2018-12-23 22:52:01,375 INFO [NiFi Web Server-24136]
>>> o.a.n.w.a.c.AccessDeniedExceptionMapper identity[anonymous], groups[none]
>>> does not have permission to access the requested resource. Unknown user
>>> with identity 'anonymous'. Returning Unauthorized response.
>>>
>>> On Sun, Dec 23, 2018 at 3:50 PM dan young <da...@gmail.com> wrote:
>>>
>>> Hello Mark,
>>>
>>> I have a queue again with a "stuck/phantom" flowfile again.  When I try
>>> to call the nifi-api/processors/<processor-id>/diagnostics against a
>>> processor, in the UI after I authenticate, I get a "Unknown user with
>>> identity 'anonymous'. Contact the system administrator." We're running a
>>> secure 3x node cluster. I tried this via the browser and also via the
>>> command line with curl on one of the nodes. One clarification point, what
>>> processor id should I be trying to gather the diagnostics on? the the queue
>>> is in between two processor groups.
>>>
>>> Maybe the issue with the Unknown User has to do with some policy I don't
>>> have set correctly?
>>>
>>> Happy Holidays!
>>>
>>> Regards,
>>> Dano
>>>
>>>
>>>
>>>
>>> On Wed, Dec 19, 2018 at 6:51 AM Mark Payne <ma...@hotmail.com> wrote:
>>>
>>> Hey Josef, Dano,
>>>
>>> Firstly, let me assure you that while I may be the only one from the
>>> NiFi side who's been engaging on debugging
>>> this, I am far from the only one who cares about it! :) This is a pretty
>>> big new feature that was added to the latest
>>> release, so understandably there are probably not yet a lot of people
>>> who understand the code well enough to
>>> debug. I have tried replicating the issue, but have not been successful.
>>> I have a 3-node cluster that ran for well over
>>> a month without a restart, and i've also tried restarting it every few
>>> hours for a couple of days. It has about 8 different
>>> load-balanced connections, with varying data sizes and volumes. I've not
>>> been able to get into this situation, though,
>>> unfortunately.
>>>
>>> But yes, I think that we've seen this issue arise from each of the two
>>> of you and one other on the mailing list, so it
>>> is certainly something that we need to nail down ASAP. Unfortunately,
>>> debugging an issue that involves communication
>>> between multiple nodes is often difficult to fully understand, so it may
>>> not be a trivial task to debug.
>>>
>>> Dano, if you are able to get to the diagnostics, as Josef mentioned,
>>> that is likely to be pretty helpful. Off the top of my head,
>>> there are a few possibilities that are coming to mind, as to what kind
>>> of bug could cause such behavior:
>>>
>>> 1) Perhaps there really is no flowfile in the queue, but we somehow
>>> miscalculated the size of the queue. The diagnostics
>>> info would tell us whether or not this is the case. It will look into
>>> the queues themselves to determine how many FlowFiles are
>>> destined for each node in the cluster, rather than just returning the
>>> pre-calculated count. Failing that, you could also stop the source
>>> and destination of the queue, restart the node, and then see if the
>>> FlowFile is entirely gone from the queue on restart, or if it remains
>>> in the queue. If it is gone, then that likely indicates that the
>>> pre-computed count is somehow off.
>>>
>>> 2) We are having trouble communicating with the node that we are trying
>>> to send the data to. I would expect some sort of ERROR
>>> log messages in this case.
>>>
>>> 3) The node is properly sending the FlowFile to where it needs to go,
>>> but for some reason the receiving node is then re-distributing it
>>> to another node in the cluster, which then re-distributes it again, so
>>> that it never ends in the correct destination. I think this is unlikely
>>> and would be easy to verify by looking at the "Summary" table [1] and
>>> doing the "Cluster view" and constantly refreshing for a few seconds
>>> to see if the queue changes on any node in the cluster.
>>>
>>> 4) For some entirely unknown reason, there exists a bug that causes the
>>> node to simply see the FlowFile and just skip over it
>>> entirely.
>>>
>>> For additional logging, we can enable DEBUG logging on
>>> org.apache.nifi.controller.queue.clustered.client.async.nio.
>>> NioAsyncLoadBalanceClientTask:
>>> <logger name="
>>> org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask"
>>> level="DEBUG" />
>>>
>>> With that DEBUG logging turned on, it may or may not generate a lot of
>>> DEBUG logs. If it does not, then that in and of itself tells us something.
>>> If it does generate a lot of DEBUG logs, then it would be good to see
>>> what it's dumping out in the logs.
>>>
>>> And a big Thank You to you guys for staying engaged on this and your
>>> willingness to dig in!
>>>
>>> Thanks!
>>> -Mark
>>>
>>> [1]
>>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page
>>>
>>>
>>> On Dec 19, 2018, at 2:18 AM, <Jo...@swisscom.com> <
>>> Josef.Zahner1@swisscom.com> wrote:
>>>
>>> Hi Dano
>>>
>>> Seems that the problem has been seen by a few people but until now
>>> nobody from NiFi team really cared about it – except Mark Payne. He
>>> mentioned the part below with the diagnostics, however in my case this
>>> doesn’t even work (tried it on standalone unsecured cluster as well as on
>>> secured cluster)! Can you get the diagnostics on your cluster?
>>>
>>> I guess at the end we have to open a Jira ticket to narrow it down.
>>>
>>> Cheers Josef
>>>
>>>
>>> One thing that I would recommend, to get more information, is to go to
>>> the REST endpoint (in your browser is fine)
>>> /nifi-api/processors/<processor id>/diagnostics
>>>
>>> Where <processor id> is the UUID of either the source or the destination
>>> of the Connection in question. This gives us
>>> a lot of information about the internals of Connection. The easiest way
>>> to get that Processor ID is to just click on the
>>> processor on the canvas and look at the Operate palette on the left-hand
>>> side. You can copy & paste from there. If you
>>> then send the diagnostics information to us, we can analyze that to help
>>> understand what's happening.
>>>
>>>
>>>
>>> *From: *dan young <da...@gmail.com>
>>> *Reply-To: *"users@nifi.apache.org" <us...@nifi.apache.org>
>>> *Date: *Wednesday, 19 December 2018 at 05:28
>>> *To: *NiFi Mailing List <us...@nifi.apache.org>
>>> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> We're seeing this more frequently where flowfiles seem to be stuck in a
>>> load balanced queue.  The only resolution is to disconnect the node and
>>> then restart that node.  After this, the flowfile disappears from the
>>> queue.  Any ideas on what might be going on here or what additional
>>> information I might be able to provide to debug this?
>>>
>>> I've attached another thread dump and some screen shots....
>>>
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>> --
>>> Sent from Gmail Mobile
>>>
>>> <Screen Shot 2018-12-24 at 9.12.31 AM.png>
>>>
>>> <diag.json>
>>>
>>> <conn.json>
>>>
>>>
>

Re: flowfiles stuck in load balanced queue; nifi 1.8

Posted by Mark Payne <ma...@hotmail.com>.

Excellent! Thanks for the follow-up.

Sent from my iPhone

On Jan 5, 2019, at 8:19 PM, dan young <da...@gmail.com>> wrote:

Heya Mark,

Just a FYI, so far so good...after switching over to your recommendation we haven't seen any "stuck" flowfiles.  Things looking good so far and will look forward to 1.9.


Regards,

Dano

On Fri, Dec 28, 2018 at 8:43 AM Mark Payne <ma...@hotmail.com>> wrote:
Dan, et al,

Great news! I was able to replicate this issue finally, by creating a Load-Balanced connection
between two Process Groups/Ports instead of between two processors. The fact that it's between
two Ports does not, in and of itself, matter. But there is a race condition, and Ports do no actual
Processing of the FlowFile (simply pull it from one queue and transfer it to another). As a result, because
it is extremely fast, it is more likely to trigger the race condition.

So I created a JIRA [1] and have submitted a PR for it.

Interestingly, while there is no real workaround that is fool-proof, until this fix is in and released, you could
choose to update your flow so that the connection between Process Groups is not load balanced and instead
the connection between the Input Port and the first Processor is load balanced. Again, this is not fool-proof,
because it could affect the Load Balanced Connection even if it is connected to a Processor, but it is less likely
to do so, so you would likely see the issue occur far less often.

Thank you so much for sticking with us all as we diagnose this and figure it all out - would not have been able to
figure it out without you spending the time to debug the issue!

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-5919


On Dec 26, 2018, at 10:31 PM, dan young <da...@gmail.com>> wrote:

Hello Mark,

I just stopped the destination processor, and then disconnected the node in question (nifi1-1). Once I disconnected the node, the flow file in the load balance connection disappeared from the queue.  After that, I reconnected the node (with the downstream processor disconnected) and once the node successfully rejoined the cluster, the flowfile showed up in the queue again. After this, I started the connected downstream processor, but the flowfile stays in the queue. The only way to clear the queue is if I actually restart the node.  If I disconnect the node, and then restart that node, the flowfile is no longer present in the queue.

Regards,

Dano


On Wed, Dec 26, 2018 at 6:13 PM Mark Payne <ma...@hotmail.com>> wrote:
Ok, I just wanted to confirm that when you said “once it rejoins the cluster that flow file is gone” that you mean “the flowfile did not exist on the system” and NOT “the queue size was 0 by the time that I looked at the UI.” I.e., is it possible that the FlowFile did exist, was restored, and then was processed before you looked at the UI? Or the FlowFile definitely did not exist after the node was restarted? That’s why I was suggesting that you restart with the connection’s source and destination stopped. Just to make sure that the FlowFile didn’t just get processed quickly on restart.

Sent from my iPhone

On Dec 26, 2018, at 7:55 PM, dan young <da...@gmail.com>> wrote:

Heya Mark,

If we restart the node, that "stuck" flowfile will disappear. This is the only way so far to clear out the flowfile. I usually disconnect the node, then once it's disconnected I restart nifi, and then once it rejoins the cluster that flow file is gone. If we try to empty the queue, it will just say that there no flow files in the queue.


On Wed, Dec 26, 2018, 5:22 PM Mark Payne <ma...@hotmail.com> wrote:
Hey Dan,

Thanks, this is super useful! So, the following section is the damning part of the JSON:

          {
            "totalFlowFileCount": 1,
            "totalByteCount": 975890,
            "nodeIdentifier": "nifi1-1:9443",
            "localQueuePartition": {
              "totalFlowFileCount": 0,
              "totalByteCount": 0,
              "activeQueueFlowFileCount": 0,
              "activeQueueByteCount": 0,
              "swapFlowFileCount": 0,
              "swapByteCount": 0,
              "swapFiles": 0,
              "inFlightFlowFileCount": 0,
              "inFlightByteCount": 0,
              "allActiveQueueFlowFilesPenalized": false,
              "anyActiveQueueFlowFilesPenalized": false
            },
            "remoteQueuePartitions": [
              {
                "totalFlowFileCount": 0,
                "totalByteCount": 0,
                "activeQueueFlowFileCount": 0,
                "activeQueueByteCount": 0,
                "swapFlowFileCount": 0,
                "swapByteCount": 0,
                "swapFiles": 0,
                "inFlightFlowFileCount": 0,
                "inFlightByteCount": 0,
                "nodeIdentifier": "nifi2-1:9443"
              },
              {
                "totalFlowFileCount": 0,
                "totalByteCount": 0,
                "activeQueueFlowFileCount": 0,
                "activeQueueByteCount": 0,
                "swapFlowFileCount": 0,
                "swapByteCount": 0,
                "swapFiles": 0,
                "inFlightFlowFileCount": 0,
                "inFlightByteCount": 0,
                "nodeIdentifier": "nifi3-1:9443"
              }
            ]
          }

It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 bytes. But it also shows that the FlowFile is not in the "local partition" or either of the two "remote partitions." So that leaves us with two possibilities:

1) The Queue's Count is wrong, because it somehow did not get decremented (perhaps a threading bug?)

Or

2) The Count is correct and the FlowFile exists, but somehow the reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a threading bug?)

If possible, I would for you to stop both the source and destination of that connection and then restart node nifi1-1. Once it has restarted, check if the FlowFile is still in the connection. That will tell us which of the two above scenarios is taking place. If the FlowFile exists upon restart, then the Queue somehow lost the handle to it. If the FlowFile does not exist in the connection upon restart (I'm guessing this will be the case), then it indicates that somehow the count is incorrect.

Many thanks
-Mark

________________________________
From: dan young <da...@gmail.com>>
Sent: Wednesday, December 26, 2018 9:18 AM
To: NiFi Mailing List
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

So I added a Log Attribute Processor and routed the connection that had the "stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute processor before I started it, and then ran another diagnostics after I started it.  The flowfile stayed in the load balanced connection/queue.  I've attached both files.  Please LMK if this helps.

Regards,

Dano


On Mon, Dec 24, 2018 at 10:35 AM Mark Payne <ma...@hotmail.com>> wrote:
Dan,

You would want to get diagnostics for the processor that is the source/destination of the connection - not the FlowFile. But if you connection is connecting 2 process groups then both its source and destination are Ports, not Processors. So the easiest thing to do would be to drop a “dummy processor” into the flow between the 2 groups, drag the Connection to that processor, get diagnostics for the processor, and then drag it back to where it was. Does that make sense? Sorry for the hassle.

Thanks
-Mark

Sent from my iPhone

On Dec 24, 2018, at 11:40 AM, dan young <da...@gmail.com>> wrote:

Hello Bryan,

Thank you, that was the ticket!

Mark, I was able to run the diagnostics for a processor that's downstream from the connection where the flowfile appears to be "stuck". I'm not sure what processor is the source of this particular "stuck" flowfile since we have a number of upstream processor groups (PG) that feed into a funnel.  This funnel is then connected to a downstream PG. It is this connection between the funnel and a downstream PG where the flowfile is stuck. I might reduce the upstream "load balanced connections" between the various PGs to just one so I can narrow where we need to run diagnostics....  If this isn't the correct processor to be gathering diagnostics, please LMK where else I should look or other diagnostics to run...

I've also attached the output (nifi-api/connections/{id}) of the get for that connection where the flowfile appears to be "stuck"

On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende <bb...@gmail.com>> wrote:
You’ll need to get the token that was obtained when you logged in to the SSO and submit it on the curl requests the same way the UI is doing on all requests.

You should be able to open chrome dev tool tools while in the UI and look at one of the request/responses and copy the value of the 'Authorization’ header which should be in the form ‘Bearer <token>’.

Then send this on the curl command by specifying a header of -H 'Authorization: Bearer <token>'

On Sun, Dec 23, 2018 at 6:28 PM dan young <da...@gmail.com>> wrote:
I forgot to mention that we're using the OpenId Connect SSO .  Is there a way to run these command via curl when we have the cluster configured this way?  If so would anyone be able to provide some insight/examples.

Happy Holidays!

Regards,

Dano

On Sun, Dec 23, 2018 at 3:53 PM dan young <da...@gmail.com>> wrote:
This is what I'm seeing in the logs when I try to access the nifi-api/flow/about for example...

2018-12-23 22:51:45,579 INFO [NiFi Web Server-24201] o.a.n.w.s.NiFiAuthenticationFilter Authentication success for dan@looker.com<ma...@looker.com>
2018-12-23 22:52:01,375 INFO [NiFi Web Server-24136] o.a.n.w.a.c.AccessDeniedExceptionMapper identity[anonymous], groups[none] does not have permission to access the requested resource. Unknown user with identity 'anonymous'. Returning Unauthorized response.

On Sun, Dec 23, 2018 at 3:50 PM dan young <da...@gmail.com>> wrote:
Hello Mark,

I have a queue again with a "stuck/phantom" flowfile again.  When I try to call the nifi-api/processors/<processor-id>/diagnostics against a processor, in the UI after I authenticate, I get a "Unknown user with identity 'anonymous'. Contact the system administrator." We're running a secure 3x node cluster. I tried this via the browser and also via the command line with curl on one of the nodes. One clarification point, what processor id should I be trying to gather the diagnostics on? the the queue is in between two processor groups.

Maybe the issue with the Unknown User has to do with some policy I don't have set correctly?

Happy Holidays!

Regards,
Dano




On Wed, Dec 19, 2018 at 6:51 AM Mark Payne <ma...@hotmail.com>> wrote:
Hey Josef, Dano,

Firstly, let me assure you that while I may be the only one from the NiFi side who's been engaging on debugging
this, I am far from the only one who cares about it! :) This is a pretty big new feature that was added to the latest
release, so understandably there are probably not yet a lot of people who understand the code well enough to
debug. I have tried replicating the issue, but have not been successful. I have a 3-node cluster that ran for well over
a month without a restart, and i've also tried restarting it every few hours for a couple of days. It has about 8 different
load-balanced connections, with varying data sizes and volumes. I've not been able to get into this situation, though,
unfortunately.

But yes, I think that we've seen this issue arise from each of the two of you and one other on the mailing list, so it
is certainly something that we need to nail down ASAP. Unfortunately, debugging an issue that involves communication
between multiple nodes is often difficult to fully understand, so it may not be a trivial task to debug.

Dano, if you are able to get to the diagnostics, as Josef mentioned, that is likely to be pretty helpful. Off the top of my head,
there are a few possibilities that are coming to mind, as to what kind of bug could cause such behavior:

1) Perhaps there really is no flowfile in the queue, but we somehow miscalculated the size of the queue. The diagnostics
info would tell us whether or not this is the case. It will look into the queues themselves to determine how many FlowFiles are
destined for each node in the cluster, rather than just returning the pre-calculated count. Failing that, you could also stop the source
and destination of the queue, restart the node, and then see if the FlowFile is entirely gone from the queue on restart, or if it remains
in the queue. If it is gone, then that likely indicates that the pre-computed count is somehow off.

2) We are having trouble communicating with the node that we are trying to send the data to. I would expect some sort of ERROR
log messages in this case.

3) The node is properly sending the FlowFile to where it needs to go, but for some reason the receiving node is then re-distributing it
to another node in the cluster, which then re-distributes it again, so that it never ends in the correct destination. I think this is unlikely
and would be easy to verify by looking at the "Summary" table [1] and doing the "Cluster view" and constantly refreshing for a few seconds
to see if the queue changes on any node in the cluster.

4) For some entirely unknown reason, there exists a bug that causes the node to simply see the FlowFile and just skip over it
entirely.

For additional logging, we can enable DEBUG logging on
org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask:
<logger name="org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask" level="DEBUG" />

With that DEBUG logging turned on, it may or may not generate a lot of DEBUG logs. If it does not, then that in and of itself tells us something.
If it does generate a lot of DEBUG logs, then it would be good to see what it's dumping out in the logs.

And a big Thank You to you guys for staying engaged on this and your willingness to dig in!

Thanks!
-Mark

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page


On Dec 19, 2018, at 2:18 AM, <Jo...@swisscom.com>> <Jo...@swisscom.com>> wrote:

Hi Dano

Seems that the problem has been seen by a few people but until now nobody from NiFi team really cared about it – except Mark Payne. He mentioned the part below with the diagnostics, however in my case this doesn’t even work (tried it on standalone unsecured cluster as well as on secured cluster)! Can you get the diagnostics on your cluster?

I guess at the end we have to open a Jira ticket to narrow it down.

Cheers Josef


One thing that I would recommend, to get more information, is to go to the REST endpoint (in your browser is fine)
/nifi-api/processors/<processor id>/diagnostics

Where <processor id> is the UUID of either the source or the destination of the Connection in question. This gives us
a lot of information about the internals of Connection. The easiest way to get that Processor ID is to just click on the
processor on the canvas and look at the Operate palette on the left-hand side. You can copy & paste from there. If you
then send the diagnostics information to us, we can analyze that to help understand what's happening.



From: dan young <da...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Wednesday, 19 December 2018 at 05:28
To: NiFi Mailing List <us...@nifi.apache.org>>
Subject: flowfiles stuck in load balanced queue; nifi 1.8

We're seeing this more frequently where flowfiles seem to be stuck in a load balanced queue.  The only resolution is to disconnect the node and then restart that node.  After this, the flowfile disappears from the queue.  Any ideas on what might be going on here or what additional information I might be able to provide to debug this?

I've attached another thread dump and some screen shots....


Regards,

Dano

--
Sent from Gmail Mobile
<Screen Shot 2018-12-24 at 9.12.31 AM.png>
<diag.json>
<conn.json>