You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Mike Thomsen <mi...@gmail.com> on 2022/04/01 11:28:55 UTC

Re: Round robin load balancing eventually stops using all nodes

When we talk about "slower nodes" here, are we referring to nodes that
are bogged down by data but of the same size as the rest of the
cluster or are we talking about a heterogeneous cluster?

On Mon, Sep 27, 2021 at 12:07 PM Joe Witt <jo...@gmail.com> wrote:
>
> Ryan,
>
> Regarding NIFI-9236 the JIRA captures it well but sounds like there is
> now a better understanding of how it works and what options exist to
> better view details.
>
> Regarding Load Balancing: NIFI-7081 is largely about the scenario
> whereby in load balancing cases nodes which are slower effectively set
> the rate the whole cluster can sustain because we don't have a fluid
> load balancing strategy which we should.  Such a strategy would allow
> for the fastest nodes to always take the most data.  We just need to
> do that work.  No ETA.
>
> Thanks
>
> On Tue, Sep 21, 2021 at 2:18 PM Ryan Hendrickson
> <ry...@gmail.com> wrote:
> >
> > Joe - We're testing some scenarios.  Andrew captured some confusing behavior in the UI when enabling and disabling load balancing on a relationship: "Update UI for Clustered Connections" -- https://issues.apache.org/jira/projects/NIFI/issues/NIFI-9236
> >
> > Question - When a FlowFile is Load Balanced from one node to another, is the entire Content Claim load balanced?  Or just the small portion necessary?
> >
> > Mike -
> > We found two tickets that are in the ballpark:
> >
> > 1.  Improve handling of Load Balanced Connections when one node is slow   --    https://issues.apache.org/jira/browse/NIFI-7081
> > 2.  NiFi FlowFiles stuck in queue when using Single Node load balance strategy   --    https://issues.apache.org/jira/browse/NIFI-8970
> >
> > From @Simon comment - we know we've seen underperforming nodes in a cluster before.  We're discussing @Simon's comment is applicable to the issue we're seeing
> >           > "The one thing I can think of is the scenario where one (or more) nodes are significantly slower than the other ones. In these cases it might happen then the nodes are “running behind” blocks the other nodes from balancing perspective."
> >
> > @Simon - I'd like to understand the "blocks other nodes from balancing perspective" better if you have additional information.  We're trying to replicate this scenario.
> >
> > Thanks,
> > Ryan
> >
> > On Sat, Sep 18, 2021 at 3:45 PM Mike Thomsen <mi...@gmail.com> wrote:
> >>
> >> > there is a ticket to overcome this (there is no ETA),
> >>
> >> Do you know what the Jira # is?
> >>
> >> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence <si...@gmail.com> wrote:
> >> >
> >> > Hi Mike,
> >> >
> >> > I did a quick check on the round robin balancing and based on what I found the reason for the issue must lie somewhere else, not directly within it. The one thing I can think of is the scenario where one (or more) nodes are significantly slower than the other ones. In these cases it might happen then the nodes are “running behind” blocks the other nodes from balancing perspective.
> >> >
> >> > Based on what you wrote this is a possible reason and there is a ticket to overcome this (there is no ETA), but other details might shed light to a different root cause.
> >> >
> >> > Regards,
> >> > Bence
> >> >
> >> >
> >> >
> >> > > On 2021. Sep 3., at 14:13, Mike Thomsen <mi...@gmail.com> wrote:
> >> > >
> >> > > We have a 5 node cluster, and sometimes I've noticed that round robin
> >> > > load balancing stops sending flowfiles to two of them, and sometimes
> >> > > toward the end of the data processing can get as low as a single node.
> >> > > Has anyone seen similar behavior?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Mike
> >> >

Re: Round robin load balancing eventually stops using all nodes

Posted by Mike Thomsen <mi...@gmail.com>.
I think I figured out how to get around this: partition-by-attribute
using UUID. About 10 minutes ago, I was down to 3/5 nodes on my
cluster. Switched the queues to that strategy, and the 3 full nodes
started sending work to the other two nodes without a restart.

On Fri, Apr 1, 2022 at 7:44 AM Mike Thomsen <mi...@gmail.com> wrote:
>
> I think I forgot to mention early on that we're using embedded
> ZooKeeper. Could that be a factor in this behavior?
>
> Thanks,
>
> Mike
>
> On Fri, Apr 1, 2022 at 7:28 AM Mike Thomsen <mi...@gmail.com> wrote:
> >
> > When we talk about "slower nodes" here, are we referring to nodes that
> > are bogged down by data but of the same size as the rest of the
> > cluster or are we talking about a heterogeneous cluster?
> >
> > On Mon, Sep 27, 2021 at 12:07 PM Joe Witt <jo...@gmail.com> wrote:
> > >
> > > Ryan,
> > >
> > > Regarding NIFI-9236 the JIRA captures it well but sounds like there is
> > > now a better understanding of how it works and what options exist to
> > > better view details.
> > >
> > > Regarding Load Balancing: NIFI-7081 is largely about the scenario
> > > whereby in load balancing cases nodes which are slower effectively set
> > > the rate the whole cluster can sustain because we don't have a fluid
> > > load balancing strategy which we should.  Such a strategy would allow
> > > for the fastest nodes to always take the most data.  We just need to
> > > do that work.  No ETA.
> > >
> > > Thanks
> > >
> > > On Tue, Sep 21, 2021 at 2:18 PM Ryan Hendrickson
> > > <ry...@gmail.com> wrote:
> > > >
> > > > Joe - We're testing some scenarios.  Andrew captured some confusing behavior in the UI when enabling and disabling load balancing on a relationship: "Update UI for Clustered Connections" -- https://issues.apache.org/jira/projects/NIFI/issues/NIFI-9236
> > > >
> > > > Question - When a FlowFile is Load Balanced from one node to another, is the entire Content Claim load balanced?  Or just the small portion necessary?
> > > >
> > > > Mike -
> > > > We found two tickets that are in the ballpark:
> > > >
> > > > 1.  Improve handling of Load Balanced Connections when one node is slow   --    https://issues.apache.org/jira/browse/NIFI-7081
> > > > 2.  NiFi FlowFiles stuck in queue when using Single Node load balance strategy   --    https://issues.apache.org/jira/browse/NIFI-8970
> > > >
> > > > From @Simon comment - we know we've seen underperforming nodes in a cluster before.  We're discussing @Simon's comment is applicable to the issue we're seeing
> > > >           > "The one thing I can think of is the scenario where one (or more) nodes are significantly slower than the other ones. In these cases it might happen then the nodes are “running behind” blocks the other nodes from balancing perspective."
> > > >
> > > > @Simon - I'd like to understand the "blocks other nodes from balancing perspective" better if you have additional information.  We're trying to replicate this scenario.
> > > >
> > > > Thanks,
> > > > Ryan
> > > >
> > > > On Sat, Sep 18, 2021 at 3:45 PM Mike Thomsen <mi...@gmail.com> wrote:
> > > >>
> > > >> > there is a ticket to overcome this (there is no ETA),
> > > >>
> > > >> Do you know what the Jira # is?
> > > >>
> > > >> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence <si...@gmail.com> wrote:
> > > >> >
> > > >> > Hi Mike,
> > > >> >
> > > >> > I did a quick check on the round robin balancing and based on what I found the reason for the issue must lie somewhere else, not directly within it. The one thing I can think of is the scenario where one (or more) nodes are significantly slower than the other ones. In these cases it might happen then the nodes are “running behind” blocks the other nodes from balancing perspective.
> > > >> >
> > > >> > Based on what you wrote this is a possible reason and there is a ticket to overcome this (there is no ETA), but other details might shed light to a different root cause.
> > > >> >
> > > >> > Regards,
> > > >> > Bence
> > > >> >
> > > >> >
> > > >> >
> > > >> > > On 2021. Sep 3., at 14:13, Mike Thomsen <mi...@gmail.com> wrote:
> > > >> > >
> > > >> > > We have a 5 node cluster, and sometimes I've noticed that round robin
> > > >> > > load balancing stops sending flowfiles to two of them, and sometimes
> > > >> > > toward the end of the data processing can get as low as a single node.
> > > >> > > Has anyone seen similar behavior?
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Mike
> > > >> >

Re: Round robin load balancing eventually stops using all nodes

Posted by Mike Thomsen <mi...@gmail.com>.
I think I forgot to mention early on that we're using embedded
ZooKeeper. Could that be a factor in this behavior?

Thanks,

Mike

On Fri, Apr 1, 2022 at 7:28 AM Mike Thomsen <mi...@gmail.com> wrote:
>
> When we talk about "slower nodes" here, are we referring to nodes that
> are bogged down by data but of the same size as the rest of the
> cluster or are we talking about a heterogeneous cluster?
>
> On Mon, Sep 27, 2021 at 12:07 PM Joe Witt <jo...@gmail.com> wrote:
> >
> > Ryan,
> >
> > Regarding NIFI-9236 the JIRA captures it well but sounds like there is
> > now a better understanding of how it works and what options exist to
> > better view details.
> >
> > Regarding Load Balancing: NIFI-7081 is largely about the scenario
> > whereby in load balancing cases nodes which are slower effectively set
> > the rate the whole cluster can sustain because we don't have a fluid
> > load balancing strategy which we should.  Such a strategy would allow
> > for the fastest nodes to always take the most data.  We just need to
> > do that work.  No ETA.
> >
> > Thanks
> >
> > On Tue, Sep 21, 2021 at 2:18 PM Ryan Hendrickson
> > <ry...@gmail.com> wrote:
> > >
> > > Joe - We're testing some scenarios.  Andrew captured some confusing behavior in the UI when enabling and disabling load balancing on a relationship: "Update UI for Clustered Connections" -- https://issues.apache.org/jira/projects/NIFI/issues/NIFI-9236
> > >
> > > Question - When a FlowFile is Load Balanced from one node to another, is the entire Content Claim load balanced?  Or just the small portion necessary?
> > >
> > > Mike -
> > > We found two tickets that are in the ballpark:
> > >
> > > 1.  Improve handling of Load Balanced Connections when one node is slow   --    https://issues.apache.org/jira/browse/NIFI-7081
> > > 2.  NiFi FlowFiles stuck in queue when using Single Node load balance strategy   --    https://issues.apache.org/jira/browse/NIFI-8970
> > >
> > > From @Simon comment - we know we've seen underperforming nodes in a cluster before.  We're discussing @Simon's comment is applicable to the issue we're seeing
> > >           > "The one thing I can think of is the scenario where one (or more) nodes are significantly slower than the other ones. In these cases it might happen then the nodes are “running behind” blocks the other nodes from balancing perspective."
> > >
> > > @Simon - I'd like to understand the "blocks other nodes from balancing perspective" better if you have additional information.  We're trying to replicate this scenario.
> > >
> > > Thanks,
> > > Ryan
> > >
> > > On Sat, Sep 18, 2021 at 3:45 PM Mike Thomsen <mi...@gmail.com> wrote:
> > >>
> > >> > there is a ticket to overcome this (there is no ETA),
> > >>
> > >> Do you know what the Jira # is?
> > >>
> > >> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence <si...@gmail.com> wrote:
> > >> >
> > >> > Hi Mike,
> > >> >
> > >> > I did a quick check on the round robin balancing and based on what I found the reason for the issue must lie somewhere else, not directly within it. The one thing I can think of is the scenario where one (or more) nodes are significantly slower than the other ones. In these cases it might happen then the nodes are “running behind” blocks the other nodes from balancing perspective.
> > >> >
> > >> > Based on what you wrote this is a possible reason and there is a ticket to overcome this (there is no ETA), but other details might shed light to a different root cause.
> > >> >
> > >> > Regards,
> > >> > Bence
> > >> >
> > >> >
> > >> >
> > >> > > On 2021. Sep 3., at 14:13, Mike Thomsen <mi...@gmail.com> wrote:
> > >> > >
> > >> > > We have a 5 node cluster, and sometimes I've noticed that round robin
> > >> > > load balancing stops sending flowfiles to two of them, and sometimes
> > >> > > toward the end of the data processing can get as low as a single node.
> > >> > > Has anyone seen similar behavior?
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Mike
> > >> >