You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2015/02/10 00:35:35 UTC
[jira] [Commented] (NIFI-337) Automated cluster load balancing

    [ https://issues.apache.org/jira/browse/NIFI-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313161#comment-14313161 ] 

Andrew Purtell commented on NIFI-337:
-------------------------------------

More:
{quote}
So the key will be how to devise and implement an algorithm or approach to spreading that load intelligently and so data doesn't just bounce back and forth.  If anyone knows of good papers, similar systems, or approaches they can describe for how to think through this that would be great.  Things we'll have to think about here that come to mind:

- When to start spreading the load (at what factor should we start spreading work across the cluster)

- Whether it should auto-spread by default and the user can tell it not to in certain cases or whether it should not spread by default and the user can activate it

- What the criteria are by which we should let a user control how data is partitioned (some key, round robin, etc..).   How to rebalance/re-assign partitions if a node dies or comes on-line

There are 'counter cases' too that we must keep in mind such as aggregation or bin packing grouped by some key.  In those cases all data would need to be merged together at some point and thus all data needs to be accessible at some point. Whether that means we direct all data to a single node or whether we enable cross-cluster data addressing is also a topic there.
{quote}

> Automated cluster load balancing
> --------------------------------
>
>                 Key: NIFI-337
>                 URL: https://issues.apache.org/jira/browse/NIFI-337
>             Project: Apache NiFi
>          Issue Type: New Feature
>            Reporter: Andrew Purtell
>
> On dev@ in response to an inquiry, from [~joewitt]:
> {quote}
> The processors themselves are available and ready to run on all nodes at all times.  It's really just a question of whether they have data to run on.  We have always taken the view that 'if you want scalable dataflow' use scalable interfaces. And I think that is the way to go in every case you can pull it off. That generally meant one should use datasources which offer queueing semantics where multiple independent nodes can pull from the queue with 'at-least-once' guarantees.  In addition each node has back pressure so if it falls behind it slows its rate of pickup which means other nodes in the cluster can pickup the slack.  This has worked extremely well.
> That said, I recognize that it isn't always possible to use scalable interfaces and given enough non-scalable datasources the cluster could become out of balance.  So this certainly seems like a good / valuable / fun / non-trivial problem to tackle.  If we allow connections between processors to be auto-balanced then it will make for a pretty smooth experience as users won't really have to think too much about it.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)