You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by "Peter Wicks (pwicks)" <pw...@micron.com> on 2018/06/07 17:55:19 UTC

Primary Only Content Migration

I'm sure many of you have the same situation, a flow that runs on a cluster, and at some point merges back down to a primary only processor; your files sit there in the queue with nowhere to go... We've used the work around of having a remote processor group that loops the data back to the primary node for a while, but would really like a clean/simple solution. This approach requires that users be able to put an input port on the root flow, and then route the file back down, which is a nuisance.

I have been thinking of adding either a processor that moves data between specific nodes in a cluster, or a queue (?) option that will let users migrate the content of a flowfile back to the master node. This would allow you to move data back to a primary very easily without needing RPG's and input ports at the root level.

All of my development work with NiFi has been focused on processors, so I'm not really sure where I would start with this.  Thoughts?

Thanks,
  Peter

Re: [EXT] Re: Primary Only Content Migration

Posted by Joe Witt <jo...@gmail.com>.

Certainly there are simpler approaches to take in the mean time as
this case.  Specifically using ListenHTTP/PostHTTP with an intentional
configuration to do all merging of split items on a specific node can
do the trick.

For a more general purpose and consumable solution it is worth noting:
- relying on 'primary node' depends on the primary node being the same
throughout the transfer of all parts.  Not a safe assumption of course
and it isn't in-line with the intent of the primary node construct at
this stage.

- relying on a specific host to return to based on the original host
it came from would only work in cases where data is split on an origin
system, distributed, then merged back to a single node that is the
same host.  This doesn't help cases where data doesn't have a single
origin node and it suffers from the issue of availability of the
origin node during the merge phase.

A more general purpose solution will ensure that data can be
recombined reliable on a given node in such a way that all
'correlated' items make their way back together in time and that we
distribute such merging across the cluster (think site to site with
node affinity on some hashed correlation value).

So yep - agreed - there are 'now options' with limitations and there
is a good path which will be a lot of work but very useful and
reliable.

Thanks
Joe



On Fri, Jun 8, 2018 at 12:21 PM, Brandon DeVries <br...@jhu.edu> wrote:
> In this particular case, the easiest thing to do may be to keep track of
> the hostname of the original node of the file (before split and
> distribution).  Then, just send all of the pieces back to the originator
> node when done, and merge there.  It would be nice to have automatic queue
> balancing of some sort, but that's going to be a major effort, and not
> really necessary for Peter's issue.
>
> Brandon
>
> On Fri, Jun 8, 2018 at 10:25 AM Bryan Bende <bb...@gmail.com> wrote:
>
>> I'm also curious to know how Peter is currently handling this since he
>> mentioned already sending back to primary node, and we don't yet have
>> any of the functionality that Joe, Koji, and Pierre mentioned, so the
>> only way I can think of doing that would be to use PostHttp/InvokeHttp
>> and point directly at the URL of the primary node, although that would
>> be problematic when it changes.
>>
>> For Peter's scenario it sounds like he wants to run MergeContent in
>> defragment mode on the primary node after converging all the splits to
>> that node. In that case it may be challenging to handle the scenarios
>> like Koji mentioned... if we are half way through converging
>> everything to primary node and then primary node changes, do we
>> automatically move all of the data that was already sent to the old
>> primary over to the new primary so the defragment can continue
>> successfully? Maybe there is a special type of "merge" or "reduce"
>> component that can work across the cluster and handle this.
>>
>>
>> On Fri, Jun 8, 2018 at 4:01 AM, Pierre Villard
>> <pi...@gmail.com> wrote:
>> > Hi guys,
>> >
>> > Koji is right, I initially filed NIFI-4026 to cover this kind of use
>> case.
>> >
>> > There is a lot of challenges and ways to address this subject.
>> > Auto-balancing in queues will be a super nice way forward this goal.
>> >
>> > I think an easy first step would be to add a checkbox in RPG
>> configuration
>> > allowing the user to specifically send the data to the remote primary
>> node
>> > only. That's an easy information to add in S2S peer status data requested
>> > by clients. It would need proper documentation when we have nodes
>> > disconnecting/reconnecting, but I think that's the "easiest" improvement
>> to
>> > achieve your goal here.
>> >
>> > Since the changes are rather complex, I think we need to carefully think
>> > about the solution design here.
>> >
>> > Pierre
>> >
>> >
>> >
>> > 2018-06-08 3:48 GMT+02:00 Koji Kawamura <ij...@gmail.com>:
>> >
>> >> There is an existing JIRA submitted by Pierre.
>> >> I think its goal is the same with what Joe mentioned above.
>> >> https://issues.apache.org/jira/browse/NIFI-4026
>> >>
>> >> As for hashing and routing data with affinity/correlation, I think
>> >> 'Consistent Hashing' is the most popular approach to minimize the
>> >> impact of node addition/deletion.
>> >> Applying Consistent Hashing to S2S client may not be difficult. The
>> >> challenging part is how to support cluster topology change in the
>> >> middle of transferring data that needs correlation.
>> >>
>> >> A simple challenging scenario:
>> >> Let's say there is a group of 4 FlowFiles having correlation id as
>> 'rel-A'
>> >> 1. Client sends rel-A, data-1of4 to Node1
>> >> 2. Client sends rel-A, data-2of4 to Node1
>> >> 3. NodeN is added and it takes some part in hash key space that Node1
>> >> was assigned to
>> >> 4. Client sends rel-A, data-3of4 to NodeN
>> >> 5. Client sends rel-A, data-4of4 to NodeN
>> >>
>> >> Then, a Merge processor running on Node1 and NodeN can not complete
>> >> because it won't have the whole dataset to merge.
>> >> This situation can be handled manually if we document it well.
>> >> Or adding resending loop, so that:
>> >>
>> >> 5. Client on Node1 resends rel-A, data1of4 to NodeN
>> >> 6. Client on Node1 resends rel-A, data2of4 to NodeN
>> >> 7. Merge processor on NodeN merges the FlowFiles.
>> >>
>> >> I'm interested in working on this improvement, too.
>> >>
>> >> Thanks,
>> >> Koji
>> >>
>> >>
>> >> On Fri, Jun 8, 2018 at 8:19 AM, Joe Witt <jo...@gmail.com> wrote:
>> >> > Peter
>> >> >
>> >> > I'm not sure there is a good way for a processor to drive such a thing
>> >> > with existing infrastructure.  The processor having ability to know
>> >> > about the structure of a cluster is not something we have wanted to
>> >> > expose for good reasons.  There would likely need to be a more
>> >> > fundamental point of support for this.
>> >> >
>> >> > I'm not sure what that design would look like just yet - but agreeing
>> >> > this is an important step to take soon.  If you want to start
>> >> > sketching out design ideas that would be awesome.
>> >> >
>> >> > Thanks
>> >> > On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <
>> pwicks@micron.com>
>> >> wrote:
>> >> >>
>> >> >> Joe,
>> >> >>
>> >> >> I agree it is a lot of work, which is why I was thinking of starting
>> >> with a processor that could do some of these operations before looking
>> >> further. If the processor could move flowfile's between nodes in the
>> >> cluster it would be a good step. Data comes in form a queue on any node,
>> >> but gets written out to a queue on only the desired node; or gets round
>> >> robin outputted for a distribute scenario.
>> >> >>
>> >> >> I want to work on it, and was trying to figure out if it could be
>> done
>> >> using only a processor, or if larger changes would be needed for sure.
>> >> >>
>> >> >> --Peter
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: Joe Witt [mailto:joe.witt@gmail.com]
>> >> >> Sent: Thursday, June 7, 2018 3:34 PM
>> >> >> To: dev@nifi.apache.org
>> >> >> Subject: Re: [EXT] Re: Primary Only Content Migration
>> >> >>
>> >> >> Peter,
>> >> >>
>> >> >> It isn't a pattern that is well supported now in a cluster context.
>> >> >>
>> >> >> What is needed are automatically load balanced connections with
>> >> partitioning.  This would mean a user could select a given relationship
>> and
>> >> indicate that data should automatically distributed and they should be
>> able
>> >> to express, optionally, if there is a correlation attribute that is used
>> >> for ensuring data which belongs together stays together or becomes
>> >> together.  We could use this to automatically have a connection result
>> in
>> >> data being distributed across the cluster for load balancing purposes
>> and
>> >> also ensure that data is brought back to a single node whenever
>> necessary
>> >> which is the case in certain scenarios like
>> fork/distribute/process/join/send
>> >> and things like distributed receipt then join for merging (like
>> >> defragmenting data which has been split).  To join them together we need
>> >> affinity/correlation and this could work based on some sort of hashing
>> >> mechanism where there are as many buckets as their are nodes in a
>> cluster
>> >> at a given time.  It needs a lot of thought/design/testing/etc..
>> >> >>
>> >> >> I was just having a conversation about this yesterday.  It is
>> >> definitely a thing and will be a major effort.  Will make a JIRA for
>> this
>> >> soon.
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <
>> pwicks@micron.com>
>> >> wrote:
>> >> >> > Bryan,
>> >> >> >
>> >> >> > We see this with large files that we have split up into smaller
>> files
>> >> and distributed across the cluster using site-to-site. We then want to
>> >> merge them back together, so we send them to the primary node before
>> >> continuing processing.
>> >> >> >
>> >> >> > --Peter
>> >> >> >
>> >> >> > -----Original Message-----
>> >> >> > From: Bryan Bende [mailto:bbende@gmail.com]
>> >> >> > Sent: Thursday, June 7, 2018 12:47 PM
>> >> >> > To: dev@nifi.apache.org
>> >> >> > Subject: [EXT] Re: Primary Only Content Migration
>> >> >> >
>> >> >> > Peter,
>> >> >> >
>> >> >> > There really shouldn't be any non-source processors scheduled for
>> >> primary node only. We may even want to consider preventing that option
>> when
>> >> the processor has an incoming connection to avoid creating any
>> confusion.
>> >> >> >
>> >> >> > As long as you set source processors to primary node only then
>> >> everything should be ok... if primary node changes, the source processor
>> >> starts executing on the new primary node, and any flow files it already
>> >> produced on the old primary node will continue to be worked off by the
>> >> downstream processors on the old node until they are all processed.
>> >> >> >
>> >> >> > -Bryan
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <
>> >> pwicks@micron.com> wrote:
>> >> >> >> I'm sure many of you have the same situation, a flow that runs on
>> a
>> >> cluster, and at some point merges back down to a primary only processor;
>> >> your files sit there in the queue with nowhere to go... We've used the
>> work
>> >> around of having a remote processor group that loops the data back to
>> the
>> >> primary node for a while, but would really like a clean/simple solution.
>> >> This approach requires that users be able to put an input port on the
>> root
>> >> flow, and then route the file back down, which is a nuisance.
>> >> >> >>
>> >> >> >> I have been thinking of adding either a processor that moves data
>> >> between specific nodes in a cluster, or a queue (?) option that will let
>> >> users migrate the content of a flowfile back to the master node. This
>> would
>> >> allow you to move data back to a primary very easily without needing
>> RPG's
>> >> and input ports at the root level.
>> >> >> >>
>> >> >> >> All of my development work with NiFi has been focused on
>> processors,
>> >> so I'm not really sure where I would start with this.  Thoughts?
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >>   Peter
>> >>
>>

Re: [EXT] Re: Primary Only Content Migration

Posted by Brandon DeVries <br...@jhu.edu>.

In this particular case, the easiest thing to do may be to keep track of
the hostname of the original node of the file (before split and
distribution).  Then, just send all of the pieces back to the originator
node when done, and merge there.  It would be nice to have automatic queue
balancing of some sort, but that's going to be a major effort, and not
really necessary for Peter's issue.

Brandon

On Fri, Jun 8, 2018 at 10:25 AM Bryan Bende <bb...@gmail.com> wrote:

> I'm also curious to know how Peter is currently handling this since he
> mentioned already sending back to primary node, and we don't yet have
> any of the functionality that Joe, Koji, and Pierre mentioned, so the
> only way I can think of doing that would be to use PostHttp/InvokeHttp
> and point directly at the URL of the primary node, although that would
> be problematic when it changes.
>
> For Peter's scenario it sounds like he wants to run MergeContent in
> defragment mode on the primary node after converging all the splits to
> that node. In that case it may be challenging to handle the scenarios
> like Koji mentioned... if we are half way through converging
> everything to primary node and then primary node changes, do we
> automatically move all of the data that was already sent to the old
> primary over to the new primary so the defragment can continue
> successfully? Maybe there is a special type of "merge" or "reduce"
> component that can work across the cluster and handle this.
>
>
> On Fri, Jun 8, 2018 at 4:01 AM, Pierre Villard
> <pi...@gmail.com> wrote:
> > Hi guys,
> >
> > Koji is right, I initially filed NIFI-4026 to cover this kind of use
> case.
> >
> > There is a lot of challenges and ways to address this subject.
> > Auto-balancing in queues will be a super nice way forward this goal.
> >
> > I think an easy first step would be to add a checkbox in RPG
> configuration
> > allowing the user to specifically send the data to the remote primary
> node
> > only. That's an easy information to add in S2S peer status data requested
> > by clients. It would need proper documentation when we have nodes
> > disconnecting/reconnecting, but I think that's the "easiest" improvement
> to
> > achieve your goal here.
> >
> > Since the changes are rather complex, I think we need to carefully think
> > about the solution design here.
> >
> > Pierre
> >
> >
> >
> > 2018-06-08 3:48 GMT+02:00 Koji Kawamura <ij...@gmail.com>:
> >
> >> There is an existing JIRA submitted by Pierre.
> >> I think its goal is the same with what Joe mentioned above.
> >> https://issues.apache.org/jira/browse/NIFI-4026
> >>
> >> As for hashing and routing data with affinity/correlation, I think
> >> 'Consistent Hashing' is the most popular approach to minimize the
> >> impact of node addition/deletion.
> >> Applying Consistent Hashing to S2S client may not be difficult. The
> >> challenging part is how to support cluster topology change in the
> >> middle of transferring data that needs correlation.
> >>
> >> A simple challenging scenario:
> >> Let's say there is a group of 4 FlowFiles having correlation id as
> 'rel-A'
> >> 1. Client sends rel-A, data-1of4 to Node1
> >> 2. Client sends rel-A, data-2of4 to Node1
> >> 3. NodeN is added and it takes some part in hash key space that Node1
> >> was assigned to
> >> 4. Client sends rel-A, data-3of4 to NodeN
> >> 5. Client sends rel-A, data-4of4 to NodeN
> >>
> >> Then, a Merge processor running on Node1 and NodeN can not complete
> >> because it won't have the whole dataset to merge.
> >> This situation can be handled manually if we document it well.
> >> Or adding resending loop, so that:
> >>
> >> 5. Client on Node1 resends rel-A, data1of4 to NodeN
> >> 6. Client on Node1 resends rel-A, data2of4 to NodeN
> >> 7. Merge processor on NodeN merges the FlowFiles.
> >>
> >> I'm interested in working on this improvement, too.
> >>
> >> Thanks,
> >> Koji
> >>
> >>
> >> On Fri, Jun 8, 2018 at 8:19 AM, Joe Witt <jo...@gmail.com> wrote:
> >> > Peter
> >> >
> >> > I'm not sure there is a good way for a processor to drive such a thing
> >> > with existing infrastructure.  The processor having ability to know
> >> > about the structure of a cluster is not something we have wanted to
> >> > expose for good reasons.  There would likely need to be a more
> >> > fundamental point of support for this.
> >> >
> >> > I'm not sure what that design would look like just yet - but agreeing
> >> > this is an important step to take soon.  If you want to start
> >> > sketching out design ideas that would be awesome.
> >> >
> >> > Thanks
> >> > On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <
> pwicks@micron.com>
> >> wrote:
> >> >>
> >> >> Joe,
> >> >>
> >> >> I agree it is a lot of work, which is why I was thinking of starting
> >> with a processor that could do some of these operations before looking
> >> further. If the processor could move flowfile's between nodes in the
> >> cluster it would be a good step. Data comes in form a queue on any node,
> >> but gets written out to a queue on only the desired node; or gets round
> >> robin outputted for a distribute scenario.
> >> >>
> >> >> I want to work on it, and was trying to figure out if it could be
> done
> >> using only a processor, or if larger changes would be needed for sure.
> >> >>
> >> >> --Peter
> >> >>
> >> >> -----Original Message-----
> >> >> From: Joe Witt [mailto:joe.witt@gmail.com]
> >> >> Sent: Thursday, June 7, 2018 3:34 PM
> >> >> To: dev@nifi.apache.org
> >> >> Subject: Re: [EXT] Re: Primary Only Content Migration
> >> >>
> >> >> Peter,
> >> >>
> >> >> It isn't a pattern that is well supported now in a cluster context.
> >> >>
> >> >> What is needed are automatically load balanced connections with
> >> partitioning.  This would mean a user could select a given relationship
> and
> >> indicate that data should automatically distributed and they should be
> able
> >> to express, optionally, if there is a correlation attribute that is used
> >> for ensuring data which belongs together stays together or becomes
> >> together.  We could use this to automatically have a connection result
> in
> >> data being distributed across the cluster for load balancing purposes
> and
> >> also ensure that data is brought back to a single node whenever
> necessary
> >> which is the case in certain scenarios like
> fork/distribute/process/join/send
> >> and things like distributed receipt then join for merging (like
> >> defragmenting data which has been split).  To join them together we need
> >> affinity/correlation and this could work based on some sort of hashing
> >> mechanism where there are as many buckets as their are nodes in a
> cluster
> >> at a given time.  It needs a lot of thought/design/testing/etc..
> >> >>
> >> >> I was just having a conversation about this yesterday.  It is
> >> definitely a thing and will be a major effort.  Will make a JIRA for
> this
> >> soon.
> >> >>
> >> >> Thanks
> >> >>
> >> >> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <
> pwicks@micron.com>
> >> wrote:
> >> >> > Bryan,
> >> >> >
> >> >> > We see this with large files that we have split up into smaller
> files
> >> and distributed across the cluster using site-to-site. We then want to
> >> merge them back together, so we send them to the primary node before
> >> continuing processing.
> >> >> >
> >> >> > --Peter
> >> >> >
> >> >> > -----Original Message-----
> >> >> > From: Bryan Bende [mailto:bbende@gmail.com]
> >> >> > Sent: Thursday, June 7, 2018 12:47 PM
> >> >> > To: dev@nifi.apache.org
> >> >> > Subject: [EXT] Re: Primary Only Content Migration
> >> >> >
> >> >> > Peter,
> >> >> >
> >> >> > There really shouldn't be any non-source processors scheduled for
> >> primary node only. We may even want to consider preventing that option
> when
> >> the processor has an incoming connection to avoid creating any
> confusion.
> >> >> >
> >> >> > As long as you set source processors to primary node only then
> >> everything should be ok... if primary node changes, the source processor
> >> starts executing on the new primary node, and any flow files it already
> >> produced on the old primary node will continue to be worked off by the
> >> downstream processors on the old node until they are all processed.
> >> >> >
> >> >> > -Bryan
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <
> >> pwicks@micron.com> wrote:
> >> >> >> I'm sure many of you have the same situation, a flow that runs on
> a
> >> cluster, and at some point merges back down to a primary only processor;
> >> your files sit there in the queue with nowhere to go... We've used the
> work
> >> around of having a remote processor group that loops the data back to
> the
> >> primary node for a while, but would really like a clean/simple solution.
> >> This approach requires that users be able to put an input port on the
> root
> >> flow, and then route the file back down, which is a nuisance.
> >> >> >>
> >> >> >> I have been thinking of adding either a processor that moves data
> >> between specific nodes in a cluster, or a queue (?) option that will let
> >> users migrate the content of a flowfile back to the master node. This
> would
> >> allow you to move data back to a primary very easily without needing
> RPG's
> >> and input ports at the root level.
> >> >> >>
> >> >> >> All of my development work with NiFi has been focused on
> processors,
> >> so I'm not really sure where I would start with this.  Thoughts?
> >> >> >>
> >> >> >> Thanks,
> >> >> >>   Peter
> >>
>

Re: [EXT] Re: Primary Only Content Migration

Posted by Bryan Bende <bb...@gmail.com>.

I'm also curious to know how Peter is currently handling this since he
mentioned already sending back to primary node, and we don't yet have
any of the functionality that Joe, Koji, and Pierre mentioned, so the
only way I can think of doing that would be to use PostHttp/InvokeHttp
and point directly at the URL of the primary node, although that would
be problematic when it changes.

For Peter's scenario it sounds like he wants to run MergeContent in
defragment mode on the primary node after converging all the splits to
that node. In that case it may be challenging to handle the scenarios
like Koji mentioned... if we are half way through converging
everything to primary node and then primary node changes, do we
automatically move all of the data that was already sent to the old
primary over to the new primary so the defragment can continue
successfully? Maybe there is a special type of "merge" or "reduce"
component that can work across the cluster and handle this.


On Fri, Jun 8, 2018 at 4:01 AM, Pierre Villard
<pi...@gmail.com> wrote:
> Hi guys,
>
> Koji is right, I initially filed NIFI-4026 to cover this kind of use case.
>
> There is a lot of challenges and ways to address this subject.
> Auto-balancing in queues will be a super nice way forward this goal.
>
> I think an easy first step would be to add a checkbox in RPG configuration
> allowing the user to specifically send the data to the remote primary node
> only. That's an easy information to add in S2S peer status data requested
> by clients. It would need proper documentation when we have nodes
> disconnecting/reconnecting, but I think that's the "easiest" improvement to
> achieve your goal here.
>
> Since the changes are rather complex, I think we need to carefully think
> about the solution design here.
>
> Pierre
>
>
>
> 2018-06-08 3:48 GMT+02:00 Koji Kawamura <ij...@gmail.com>:
>
>> There is an existing JIRA submitted by Pierre.
>> I think its goal is the same with what Joe mentioned above.
>> https://issues.apache.org/jira/browse/NIFI-4026
>>
>> As for hashing and routing data with affinity/correlation, I think
>> 'Consistent Hashing' is the most popular approach to minimize the
>> impact of node addition/deletion.
>> Applying Consistent Hashing to S2S client may not be difficult. The
>> challenging part is how to support cluster topology change in the
>> middle of transferring data that needs correlation.
>>
>> A simple challenging scenario:
>> Let's say there is a group of 4 FlowFiles having correlation id as 'rel-A'
>> 1. Client sends rel-A, data-1of4 to Node1
>> 2. Client sends rel-A, data-2of4 to Node1
>> 3. NodeN is added and it takes some part in hash key space that Node1
>> was assigned to
>> 4. Client sends rel-A, data-3of4 to NodeN
>> 5. Client sends rel-A, data-4of4 to NodeN
>>
>> Then, a Merge processor running on Node1 and NodeN can not complete
>> because it won't have the whole dataset to merge.
>> This situation can be handled manually if we document it well.
>> Or adding resending loop, so that:
>>
>> 5. Client on Node1 resends rel-A, data1of4 to NodeN
>> 6. Client on Node1 resends rel-A, data2of4 to NodeN
>> 7. Merge processor on NodeN merges the FlowFiles.
>>
>> I'm interested in working on this improvement, too.
>>
>> Thanks,
>> Koji
>>
>>
>> On Fri, Jun 8, 2018 at 8:19 AM, Joe Witt <jo...@gmail.com> wrote:
>> > Peter
>> >
>> > I'm not sure there is a good way for a processor to drive such a thing
>> > with existing infrastructure.  The processor having ability to know
>> > about the structure of a cluster is not something we have wanted to
>> > expose for good reasons.  There would likely need to be a more
>> > fundamental point of support for this.
>> >
>> > I'm not sure what that design would look like just yet - but agreeing
>> > this is an important step to take soon.  If you want to start
>> > sketching out design ideas that would be awesome.
>> >
>> > Thanks
>> > On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <pw...@micron.com>
>> wrote:
>> >>
>> >> Joe,
>> >>
>> >> I agree it is a lot of work, which is why I was thinking of starting
>> with a processor that could do some of these operations before looking
>> further. If the processor could move flowfile's between nodes in the
>> cluster it would be a good step. Data comes in form a queue on any node,
>> but gets written out to a queue on only the desired node; or gets round
>> robin outputted for a distribute scenario.
>> >>
>> >> I want to work on it, and was trying to figure out if it could be done
>> using only a processor, or if larger changes would be needed for sure.
>> >>
>> >> --Peter
>> >>
>> >> -----Original Message-----
>> >> From: Joe Witt [mailto:joe.witt@gmail.com]
>> >> Sent: Thursday, June 7, 2018 3:34 PM
>> >> To: dev@nifi.apache.org
>> >> Subject: Re: [EXT] Re: Primary Only Content Migration
>> >>
>> >> Peter,
>> >>
>> >> It isn't a pattern that is well supported now in a cluster context.
>> >>
>> >> What is needed are automatically load balanced connections with
>> partitioning.  This would mean a user could select a given relationship and
>> indicate that data should automatically distributed and they should be able
>> to express, optionally, if there is a correlation attribute that is used
>> for ensuring data which belongs together stays together or becomes
>> together.  We could use this to automatically have a connection result in
>> data being distributed across the cluster for load balancing purposes and
>> also ensure that data is brought back to a single node whenever necessary
>> which is the case in certain scenarios like fork/distribute/process/join/send
>> and things like distributed receipt then join for merging (like
>> defragmenting data which has been split).  To join them together we need
>> affinity/correlation and this could work based on some sort of hashing
>> mechanism where there are as many buckets as their are nodes in a cluster
>> at a given time.  It needs a lot of thought/design/testing/etc..
>> >>
>> >> I was just having a conversation about this yesterday.  It is
>> definitely a thing and will be a major effort.  Will make a JIRA for this
>> soon.
>> >>
>> >> Thanks
>> >>
>> >> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pw...@micron.com>
>> wrote:
>> >> > Bryan,
>> >> >
>> >> > We see this with large files that we have split up into smaller files
>> and distributed across the cluster using site-to-site. We then want to
>> merge them back together, so we send them to the primary node before
>> continuing processing.
>> >> >
>> >> > --Peter
>> >> >
>> >> > -----Original Message-----
>> >> > From: Bryan Bende [mailto:bbende@gmail.com]
>> >> > Sent: Thursday, June 7, 2018 12:47 PM
>> >> > To: dev@nifi.apache.org
>> >> > Subject: [EXT] Re: Primary Only Content Migration
>> >> >
>> >> > Peter,
>> >> >
>> >> > There really shouldn't be any non-source processors scheduled for
>> primary node only. We may even want to consider preventing that option when
>> the processor has an incoming connection to avoid creating any confusion.
>> >> >
>> >> > As long as you set source processors to primary node only then
>> everything should be ok... if primary node changes, the source processor
>> starts executing on the new primary node, and any flow files it already
>> produced on the old primary node will continue to be worked off by the
>> downstream processors on the old node until they are all processed.
>> >> >
>> >> > -Bryan
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <
>> pwicks@micron.com> wrote:
>> >> >> I'm sure many of you have the same situation, a flow that runs on a
>> cluster, and at some point merges back down to a primary only processor;
>> your files sit there in the queue with nowhere to go... We've used the work
>> around of having a remote processor group that loops the data back to the
>> primary node for a while, but would really like a clean/simple solution.
>> This approach requires that users be able to put an input port on the root
>> flow, and then route the file back down, which is a nuisance.
>> >> >>
>> >> >> I have been thinking of adding either a processor that moves data
>> between specific nodes in a cluster, or a queue (?) option that will let
>> users migrate the content of a flowfile back to the master node. This would
>> allow you to move data back to a primary very easily without needing RPG's
>> and input ports at the root level.
>> >> >>
>> >> >> All of my development work with NiFi has been focused on processors,
>> so I'm not really sure where I would start with this.  Thoughts?
>> >> >>
>> >> >> Thanks,
>> >> >>   Peter
>>

Re: [EXT] Re: Primary Only Content Migration

Posted by Pierre Villard <pi...@gmail.com>.

Hi guys,

Koji is right, I initially filed NIFI-4026 to cover this kind of use case.

There is a lot of challenges and ways to address this subject.
Auto-balancing in queues will be a super nice way forward this goal.

I think an easy first step would be to add a checkbox in RPG configuration
allowing the user to specifically send the data to the remote primary node
only. That's an easy information to add in S2S peer status data requested
by clients. It would need proper documentation when we have nodes
disconnecting/reconnecting, but I think that's the "easiest" improvement to
achieve your goal here.

Since the changes are rather complex, I think we need to carefully think
about the solution design here.

Pierre



2018-06-08 3:48 GMT+02:00 Koji Kawamura <ij...@gmail.com>:

> There is an existing JIRA submitted by Pierre.
> I think its goal is the same with what Joe mentioned above.
> https://issues.apache.org/jira/browse/NIFI-4026
>
> As for hashing and routing data with affinity/correlation, I think
> 'Consistent Hashing' is the most popular approach to minimize the
> impact of node addition/deletion.
> Applying Consistent Hashing to S2S client may not be difficult. The
> challenging part is how to support cluster topology change in the
> middle of transferring data that needs correlation.
>
> A simple challenging scenario:
> Let's say there is a group of 4 FlowFiles having correlation id as 'rel-A'
> 1. Client sends rel-A, data-1of4 to Node1
> 2. Client sends rel-A, data-2of4 to Node1
> 3. NodeN is added and it takes some part in hash key space that Node1
> was assigned to
> 4. Client sends rel-A, data-3of4 to NodeN
> 5. Client sends rel-A, data-4of4 to NodeN
>
> Then, a Merge processor running on Node1 and NodeN can not complete
> because it won't have the whole dataset to merge.
> This situation can be handled manually if we document it well.
> Or adding resending loop, so that:
>
> 5. Client on Node1 resends rel-A, data1of4 to NodeN
> 6. Client on Node1 resends rel-A, data2of4 to NodeN
> 7. Merge processor on NodeN merges the FlowFiles.
>
> I'm interested in working on this improvement, too.
>
> Thanks,
> Koji
>
>
> On Fri, Jun 8, 2018 at 8:19 AM, Joe Witt <jo...@gmail.com> wrote:
> > Peter
> >
> > I'm not sure there is a good way for a processor to drive such a thing
> > with existing infrastructure.  The processor having ability to know
> > about the structure of a cluster is not something we have wanted to
> > expose for good reasons.  There would likely need to be a more
> > fundamental point of support for this.
> >
> > I'm not sure what that design would look like just yet - but agreeing
> > this is an important step to take soon.  If you want to start
> > sketching out design ideas that would be awesome.
> >
> > Thanks
> > On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <pw...@micron.com>
> wrote:
> >>
> >> Joe,
> >>
> >> I agree it is a lot of work, which is why I was thinking of starting
> with a processor that could do some of these operations before looking
> further. If the processor could move flowfile's between nodes in the
> cluster it would be a good step. Data comes in form a queue on any node,
> but gets written out to a queue on only the desired node; or gets round
> robin outputted for a distribute scenario.
> >>
> >> I want to work on it, and was trying to figure out if it could be done
> using only a processor, or if larger changes would be needed for sure.
> >>
> >> --Peter
> >>
> >> -----Original Message-----
> >> From: Joe Witt [mailto:joe.witt@gmail.com]
> >> Sent: Thursday, June 7, 2018 3:34 PM
> >> To: dev@nifi.apache.org
> >> Subject: Re: [EXT] Re: Primary Only Content Migration
> >>
> >> Peter,
> >>
> >> It isn't a pattern that is well supported now in a cluster context.
> >>
> >> What is needed are automatically load balanced connections with
> partitioning.  This would mean a user could select a given relationship and
> indicate that data should automatically distributed and they should be able
> to express, optionally, if there is a correlation attribute that is used
> for ensuring data which belongs together stays together or becomes
> together.  We could use this to automatically have a connection result in
> data being distributed across the cluster for load balancing purposes and
> also ensure that data is brought back to a single node whenever necessary
> which is the case in certain scenarios like fork/distribute/process/join/send
> and things like distributed receipt then join for merging (like
> defragmenting data which has been split).  To join them together we need
> affinity/correlation and this could work based on some sort of hashing
> mechanism where there are as many buckets as their are nodes in a cluster
> at a given time.  It needs a lot of thought/design/testing/etc..
> >>
> >> I was just having a conversation about this yesterday.  It is
> definitely a thing and will be a major effort.  Will make a JIRA for this
> soon.
> >>
> >> Thanks
> >>
> >> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pw...@micron.com>
> wrote:
> >> > Bryan,
> >> >
> >> > We see this with large files that we have split up into smaller files
> and distributed across the cluster using site-to-site. We then want to
> merge them back together, so we send them to the primary node before
> continuing processing.
> >> >
> >> > --Peter
> >> >
> >> > -----Original Message-----
> >> > From: Bryan Bende [mailto:bbende@gmail.com]
> >> > Sent: Thursday, June 7, 2018 12:47 PM
> >> > To: dev@nifi.apache.org
> >> > Subject: [EXT] Re: Primary Only Content Migration
> >> >
> >> > Peter,
> >> >
> >> > There really shouldn't be any non-source processors scheduled for
> primary node only. We may even want to consider preventing that option when
> the processor has an incoming connection to avoid creating any confusion.
> >> >
> >> > As long as you set source processors to primary node only then
> everything should be ok... if primary node changes, the source processor
> starts executing on the new primary node, and any flow files it already
> produced on the old primary node will continue to be worked off by the
> downstream processors on the old node until they are all processed.
> >> >
> >> > -Bryan
> >> >
> >> >
> >> >
> >> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <
> pwicks@micron.com> wrote:
> >> >> I'm sure many of you have the same situation, a flow that runs on a
> cluster, and at some point merges back down to a primary only processor;
> your files sit there in the queue with nowhere to go... We've used the work
> around of having a remote processor group that loops the data back to the
> primary node for a while, but would really like a clean/simple solution.
> This approach requires that users be able to put an input port on the root
> flow, and then route the file back down, which is a nuisance.
> >> >>
> >> >> I have been thinking of adding either a processor that moves data
> between specific nodes in a cluster, or a queue (?) option that will let
> users migrate the content of a flowfile back to the master node. This would
> allow you to move data back to a primary very easily without needing RPG's
> and input ports at the root level.
> >> >>
> >> >> All of my development work with NiFi has been focused on processors,
> so I'm not really sure where I would start with this.  Thoughts?
> >> >>
> >> >> Thanks,
> >> >>   Peter
>

Re: [EXT] Re: Primary Only Content Migration

Posted by Koji Kawamura <ij...@gmail.com>.

There is an existing JIRA submitted by Pierre.
I think its goal is the same with what Joe mentioned above.
https://issues.apache.org/jira/browse/NIFI-4026

As for hashing and routing data with affinity/correlation, I think
'Consistent Hashing' is the most popular approach to minimize the
impact of node addition/deletion.
Applying Consistent Hashing to S2S client may not be difficult. The
challenging part is how to support cluster topology change in the
middle of transferring data that needs correlation.

A simple challenging scenario:
Let's say there is a group of 4 FlowFiles having correlation id as 'rel-A'
1. Client sends rel-A, data-1of4 to Node1
2. Client sends rel-A, data-2of4 to Node1
3. NodeN is added and it takes some part in hash key space that Node1
was assigned to
4. Client sends rel-A, data-3of4 to NodeN
5. Client sends rel-A, data-4of4 to NodeN

Then, a Merge processor running on Node1 and NodeN can not complete
because it won't have the whole dataset to merge.
This situation can be handled manually if we document it well.
Or adding resending loop, so that:

5. Client on Node1 resends rel-A, data1of4 to NodeN
6. Client on Node1 resends rel-A, data2of4 to NodeN
7. Merge processor on NodeN merges the FlowFiles.

I'm interested in working on this improvement, too.

Thanks,
Koji


On Fri, Jun 8, 2018 at 8:19 AM, Joe Witt <jo...@gmail.com> wrote:
> Peter
>
> I'm not sure there is a good way for a processor to drive such a thing
> with existing infrastructure.  The processor having ability to know
> about the structure of a cluster is not something we have wanted to
> expose for good reasons.  There would likely need to be a more
> fundamental point of support for this.
>
> I'm not sure what that design would look like just yet - but agreeing
> this is an important step to take soon.  If you want to start
> sketching out design ideas that would be awesome.
>
> Thanks
> On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <pw...@micron.com> wrote:
>>
>> Joe,
>>
>> I agree it is a lot of work, which is why I was thinking of starting with a processor that could do some of these operations before looking further. If the processor could move flowfile's between nodes in the cluster it would be a good step. Data comes in form a queue on any node, but gets written out to a queue on only the desired node; or gets round robin outputted for a distribute scenario.
>>
>> I want to work on it, and was trying to figure out if it could be done using only a processor, or if larger changes would be needed for sure.
>>
>> --Peter
>>
>> -----Original Message-----
>> From: Joe Witt [mailto:joe.witt@gmail.com]
>> Sent: Thursday, June 7, 2018 3:34 PM
>> To: dev@nifi.apache.org
>> Subject: Re: [EXT] Re: Primary Only Content Migration
>>
>> Peter,
>>
>> It isn't a pattern that is well supported now in a cluster context.
>>
>> What is needed are automatically load balanced connections with partitioning.  This would mean a user could select a given relationship and indicate that data should automatically distributed and they should be able to express, optionally, if there is a correlation attribute that is used for ensuring data which belongs together stays together or becomes together.  We could use this to automatically have a connection result in data being distributed across the cluster for load balancing purposes and also ensure that data is brought back to a single node whenever necessary which is the case in certain scenarios like fork/distribute/process/join/send and things like distributed receipt then join for merging (like defragmenting data which has been split).  To join them together we need affinity/correlation and this could work based on some sort of hashing mechanism where there are as many buckets as their are nodes in a cluster at a given time.  It needs a lot of thought/design/testing/etc..
>>
>> I was just having a conversation about this yesterday.  It is definitely a thing and will be a major effort.  Will make a JIRA for this soon.
>>
>> Thanks
>>
>> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>> > Bryan,
>> >
>> > We see this with large files that we have split up into smaller files and distributed across the cluster using site-to-site. We then want to merge them back together, so we send them to the primary node before continuing processing.
>> >
>> > --Peter
>> >
>> > -----Original Message-----
>> > From: Bryan Bende [mailto:bbende@gmail.com]
>> > Sent: Thursday, June 7, 2018 12:47 PM
>> > To: dev@nifi.apache.org
>> > Subject: [EXT] Re: Primary Only Content Migration
>> >
>> > Peter,
>> >
>> > There really shouldn't be any non-source processors scheduled for primary node only. We may even want to consider preventing that option when the processor has an incoming connection to avoid creating any confusion.
>> >
>> > As long as you set source processors to primary node only then everything should be ok... if primary node changes, the source processor starts executing on the new primary node, and any flow files it already produced on the old primary node will continue to be worked off by the downstream processors on the old node until they are all processed.
>> >
>> > -Bryan
>> >
>> >
>> >
>> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>> >> I'm sure many of you have the same situation, a flow that runs on a cluster, and at some point merges back down to a primary only processor; your files sit there in the queue with nowhere to go... We've used the work around of having a remote processor group that loops the data back to the primary node for a while, but would really like a clean/simple solution. This approach requires that users be able to put an input port on the root flow, and then route the file back down, which is a nuisance.
>> >>
>> >> I have been thinking of adding either a processor that moves data between specific nodes in a cluster, or a queue (?) option that will let users migrate the content of a flowfile back to the master node. This would allow you to move data back to a primary very easily without needing RPG's and input ports at the root level.
>> >>
>> >> All of my development work with NiFi has been focused on processors, so I'm not really sure where I would start with this.  Thoughts?
>> >>
>> >> Thanks,
>> >>   Peter

Re: [EXT] Re: Primary Only Content Migration

Posted by Joe Witt <jo...@gmail.com>.

Peter

I'm not sure there is a good way for a processor to drive such a thing
with existing infrastructure.  The processor having ability to know
about the structure of a cluster is not something we have wanted to
expose for good reasons.  There would likely need to be a more
fundamental point of support for this.

I'm not sure what that design would look like just yet - but agreeing
this is an important step to take soon.  If you want to start
sketching out design ideas that would be awesome.

Thanks
On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <pw...@micron.com> wrote:
>
> Joe,
>
> I agree it is a lot of work, which is why I was thinking of starting with a processor that could do some of these operations before looking further. If the processor could move flowfile's between nodes in the cluster it would be a good step. Data comes in form a queue on any node, but gets written out to a queue on only the desired node; or gets round robin outputted for a distribute scenario.
>
> I want to work on it, and was trying to figure out if it could be done using only a processor, or if larger changes would be needed for sure.
>
> --Peter
>
> -----Original Message-----
> From: Joe Witt [mailto:joe.witt@gmail.com]
> Sent: Thursday, June 7, 2018 3:34 PM
> To: dev@nifi.apache.org
> Subject: Re: [EXT] Re: Primary Only Content Migration
>
> Peter,
>
> It isn't a pattern that is well supported now in a cluster context.
>
> What is needed are automatically load balanced connections with partitioning.  This would mean a user could select a given relationship and indicate that data should automatically distributed and they should be able to express, optionally, if there is a correlation attribute that is used for ensuring data which belongs together stays together or becomes together.  We could use this to automatically have a connection result in data being distributed across the cluster for load balancing purposes and also ensure that data is brought back to a single node whenever necessary which is the case in certain scenarios like fork/distribute/process/join/send and things like distributed receipt then join for merging (like defragmenting data which has been split).  To join them together we need affinity/correlation and this could work based on some sort of hashing mechanism where there are as many buckets as their are nodes in a cluster at a given time.  It needs a lot of thought/design/testing/etc..
>
> I was just having a conversation about this yesterday.  It is definitely a thing and will be a major effort.  Will make a JIRA for this soon.
>
> Thanks
>
> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> > Bryan,
> >
> > We see this with large files that we have split up into smaller files and distributed across the cluster using site-to-site. We then want to merge them back together, so we send them to the primary node before continuing processing.
> >
> > --Peter
> >
> > -----Original Message-----
> > From: Bryan Bende [mailto:bbende@gmail.com]
> > Sent: Thursday, June 7, 2018 12:47 PM
> > To: dev@nifi.apache.org
> > Subject: [EXT] Re: Primary Only Content Migration
> >
> > Peter,
> >
> > There really shouldn't be any non-source processors scheduled for primary node only. We may even want to consider preventing that option when the processor has an incoming connection to avoid creating any confusion.
> >
> > As long as you set source processors to primary node only then everything should be ok... if primary node changes, the source processor starts executing on the new primary node, and any flow files it already produced on the old primary node will continue to be worked off by the downstream processors on the old node until they are all processed.
> >
> > -Bryan
> >
> >
> >
> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> >> I'm sure many of you have the same situation, a flow that runs on a cluster, and at some point merges back down to a primary only processor; your files sit there in the queue with nowhere to go... We've used the work around of having a remote processor group that loops the data back to the primary node for a while, but would really like a clean/simple solution. This approach requires that users be able to put an input port on the root flow, and then route the file back down, which is a nuisance.
> >>
> >> I have been thinking of adding either a processor that moves data between specific nodes in a cluster, or a queue (?) option that will let users migrate the content of a flowfile back to the master node. This would allow you to move data back to a primary very easily without needing RPG's and input ports at the root level.
> >>
> >> All of my development work with NiFi has been focused on processors, so I'm not really sure where I would start with this.  Thoughts?
> >>
> >> Thanks,
> >>   Peter

RE: [EXT] Re: Primary Only Content Migration

Posted by "Peter Wicks (pwicks)" <pw...@micron.com>.

Joe,

I agree it is a lot of work, which is why I was thinking of starting with a processor that could do some of these operations before looking further. If the processor could move flowfile's between nodes in the cluster it would be a good step. Data comes in form a queue on any node, but gets written out to a queue on only the desired node; or gets round robin outputted for a distribute scenario.

I want to work on it, and was trying to figure out if it could be done using only a processor, or if larger changes would be needed for sure.

--Peter 

-----Original Message-----
From: Joe Witt [mailto:joe.witt@gmail.com] 
Sent: Thursday, June 7, 2018 3:34 PM
To: dev@nifi.apache.org
Subject: Re: [EXT] Re: Primary Only Content Migration

Peter,

It isn't a pattern that is well supported now in a cluster context.

What is needed are automatically load balanced connections with partitioning.  This would mean a user could select a given relationship and indicate that data should automatically distributed and they should be able to express, optionally, if there is a correlation attribute that is used for ensuring data which belongs together stays together or becomes together.  We could use this to automatically have a connection result in data being distributed across the cluster for load balancing purposes and also ensure that data is brought back to a single node whenever necessary which is the case in certain scenarios like fork/distribute/process/join/send and things like distributed receipt then join for merging (like defragmenting data which has been split).  To join them together we need affinity/correlation and this could work based on some sort of hashing mechanism where there are as many buckets as their are nodes in a cluster at a given time.  It needs a lot of thought/design/testing/etc..

I was just having a conversation about this yesterday.  It is definitely a thing and will be a major effort.  Will make a JIRA for this soon.

Thanks

On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> Bryan,
>
> We see this with large files that we have split up into smaller files and distributed across the cluster using site-to-site. We then want to merge them back together, so we send them to the primary node before continuing processing.
>
> --Peter
>
> -----Original Message-----
> From: Bryan Bende [mailto:bbende@gmail.com]
> Sent: Thursday, June 7, 2018 12:47 PM
> To: dev@nifi.apache.org
> Subject: [EXT] Re: Primary Only Content Migration
>
> Peter,
>
> There really shouldn't be any non-source processors scheduled for primary node only. We may even want to consider preventing that option when the processor has an incoming connection to avoid creating any confusion.
>
> As long as you set source processors to primary node only then everything should be ok... if primary node changes, the source processor starts executing on the new primary node, and any flow files it already produced on the old primary node will continue to be worked off by the downstream processors on the old node until they are all processed.
>
> -Bryan
>
>
>
> On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>> I'm sure many of you have the same situation, a flow that runs on a cluster, and at some point merges back down to a primary only processor; your files sit there in the queue with nowhere to go... We've used the work around of having a remote processor group that loops the data back to the primary node for a while, but would really like a clean/simple solution. This approach requires that users be able to put an input port on the root flow, and then route the file back down, which is a nuisance.
>>
>> I have been thinking of adding either a processor that moves data between specific nodes in a cluster, or a queue (?) option that will let users migrate the content of a flowfile back to the master node. This would allow you to move data back to a primary very easily without needing RPG's and input ports at the root level.
>>
>> All of my development work with NiFi has been focused on processors, so I'm not really sure where I would start with this.  Thoughts?
>>
>> Thanks,
>>   Peter

Re: [EXT] Re: Primary Only Content Migration

Posted by Joe Witt <jo...@gmail.com>.

Peter,

It isn't a pattern that is well supported now in a cluster context.

What is needed are automatically load balanced connections with
partitioning.  This would mean a user could select a given
relationship and indicate that data should automatically distributed
and they should be able to express, optionally, if there is a
correlation attribute that is used for ensuring data which belongs
together stays together or becomes together.  We could use this to
automatically have a connection result in data being distributed
across the cluster for load balancing purposes and also ensure that
data is brought back to a single node whenever necessary which is the
case in certain scenarios like fork/distribute/process/join/send and
things like distributed receipt then join for merging (like
defragmenting data which has been split).  To join them together we
need affinity/correlation and this could work based on some sort of
hashing mechanism where there are as many buckets as their are nodes
in a cluster at a given time.  It needs a lot of
thought/design/testing/etc..

I was just having a conversation about this yesterday.  It is
definitely a thing and will be a major effort.  Will make a JIRA for
this soon.

Thanks

On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> Bryan,
>
> We see this with large files that we have split up into smaller files and distributed across the cluster using site-to-site. We then want to merge them back together, so we send them to the primary node before continuing processing.
>
> --Peter
>
> -----Original Message-----
> From: Bryan Bende [mailto:bbende@gmail.com]
> Sent: Thursday, June 7, 2018 12:47 PM
> To: dev@nifi.apache.org
> Subject: [EXT] Re: Primary Only Content Migration
>
> Peter,
>
> There really shouldn't be any non-source processors scheduled for primary node only. We may even want to consider preventing that option when the processor has an incoming connection to avoid creating any confusion.
>
> As long as you set source processors to primary node only then everything should be ok... if primary node changes, the source processor starts executing on the new primary node, and any flow files it already produced on the old primary node will continue to be worked off by the downstream processors on the old node until they are all processed.
>
> -Bryan
>
>
>
> On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>> I'm sure many of you have the same situation, a flow that runs on a cluster, and at some point merges back down to a primary only processor; your files sit there in the queue with nowhere to go... We've used the work around of having a remote processor group that loops the data back to the primary node for a while, but would really like a clean/simple solution. This approach requires that users be able to put an input port on the root flow, and then route the file back down, which is a nuisance.
>>
>> I have been thinking of adding either a processor that moves data between specific nodes in a cluster, or a queue (?) option that will let users migrate the content of a flowfile back to the master node. This would allow you to move data back to a primary very easily without needing RPG's and input ports at the root level.
>>
>> All of my development work with NiFi has been focused on processors, so I'm not really sure where I would start with this.  Thoughts?
>>
>> Thanks,
>>   Peter

RE: [EXT] Re: Primary Only Content Migration

Posted by "Peter Wicks (pwicks)" <pw...@micron.com>.

Bryan,

We see this with large files that we have split up into smaller files and distributed across the cluster using site-to-site. We then want to merge them back together, so we send them to the primary node before continuing processing.

--Peter

-----Original Message-----
From: Bryan Bende [mailto:bbende@gmail.com] 
Sent: Thursday, June 7, 2018 12:47 PM
To: dev@nifi.apache.org
Subject: [EXT] Re: Primary Only Content Migration

Peter,

There really shouldn't be any non-source processors scheduled for primary node only. We may even want to consider preventing that option when the processor has an incoming connection to avoid creating any confusion.

As long as you set source processors to primary node only then everything should be ok... if primary node changes, the source processor starts executing on the new primary node, and any flow files it already produced on the old primary node will continue to be worked off by the downstream processors on the old node until they are all processed.

-Bryan

On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> I'm sure many of you have the same situation, a flow that runs on a cluster, and at some point merges back down to a primary only processor; your files sit there in the queue with nowhere to go... We've used the work around of having a remote processor group that loops the data back to the primary node for a while, but would really like a clean/simple solution. This approach requires that users be able to put an input port on the root flow, and then route the file back down, which is a nuisance.
>
> I have been thinking of adding either a processor that moves data between specific nodes in a cluster, or a queue (?) option that will let users migrate the content of a flowfile back to the master node. This would allow you to move data back to a primary very easily without needing RPG's and input ports at the root level.
>
> All of my development work with NiFi has been focused on processors, so I'm not really sure where I would start with this.  Thoughts?
>
> Thanks,
>   Peter

Re: Primary Only Content Migration

Posted by Bryan Bende <bb...@gmail.com>.

Peter,

There really shouldn't be any non-source processors scheduled for
primary node only. We may even want to consider preventing that option
when the processor has an incoming connection to avoid creating any
confusion.

As long as you set source processors to primary node only then
everything should be ok... if primary node changes, the source
processor starts executing on the new primary node, and any flow files
it already produced on the old primary node will continue to be worked
off by the downstream processors on the old node until they are all
processed.

-Bryan

On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> I'm sure many of you have the same situation, a flow that runs on a cluster, and at some point merges back down to a primary only processor; your files sit there in the queue with nowhere to go... We've used the work around of having a remote processor group that loops the data back to the primary node for a while, but would really like a clean/simple solution. This approach requires that users be able to put an input port on the root flow, and then route the file back down, which is a nuisance.
>
> I have been thinking of adding either a processor that moves data between specific nodes in a cluster, or a queue (?) option that will let users migrate the content of a flowfile back to the master node. This would allow you to move data back to a primary very easily without needing RPG's and input ports at the root level.
>
> All of my development work with NiFi has been focused on processors, so I'm not really sure where I would start with this.  Thoughts?
>
> Thanks,
>   Peter