You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by "Yuanzhe Yang (杨远哲)" <yy...@gmail.com> on 2016/04/25 17:21:14 UTC

Is NiFi processing streamingly?

Hi,

I have read some documentation about NiFi, but I haven’t got a clear impression about how data flows inside NiFi. Is it processed streamingly? Or does a processor get the entire intermediate result produced by its previous processor? Moreover, what is the granularity of clustering? Is it dataflow level or processor level?

Thank you very much for your clarification and your work is very much appreciated.

Regards,
Yang

Re: Is NiFi processing streamingly?

Posted by "Yuanzhe Yang (杨远哲)" <yy...@gmail.com>.

Hi Joe,

Thanks a lot for your detailed explanation. That’s clear now :D

Regards,
Yang

> On 26 Apr 2016, at 01:16, Joe Percivall <jo...@yahoo.com.INVALID> wrote:
> 
> Hello Yang,
> 
> For a cluster with a "Y" style dataflow, each node will have a run a copy of the whole flow. This means that at the merging point, only data within a cluster will get merged.
> 
> A little bit of a metaphor: say you want to create toys that combine multiple different parts together (input data) and you have two workers (nodes). The way that NiFi would break up the work is to give each worker the blueprints (the data flow) for the entire toy and each works on the necessary raw materials independently to create their own end product (end merged FlowFile). Raw materials from one worker are never merged with the raw materials of the other, they are worked on independently.
> 
> NiFi uses the same concept of isolating the work to independent workers. 
> 
> There is a little wiggle room with re-distributing work to the nodes using S2S and using the primary node only scheduling strategy but those are special cases.
> 
> Hope that metaphor helps a bit,
> Joe- - - - - - 
> Joseph Percivall
> linkedin.com/in/Percivall
> e: joepercivall@yahoo.com
> 
> 
> 
> 
> On Monday, April 25, 2016 6:29 PM, Yuanzhe Yang (杨远哲) <yy...@gmail.com> wrote:
> Hi Joe,
> 
> Thanks a lot for your explanation and suggestion. As for the clustering question, what I actually want to ask is that, for example, when we have a two node cluster and a “Y” style dataflow, will the two nodes work on the two branches respectively? If so, what will happen after the result is merged at the intersection processor? Does one node become idle? 
> 
> Regards,
> Yang
> 
> 
>> On 25 Apr 2016, at 17:44, Joe Percivall <jo...@yahoo.com.INVALID> wrote:
>> 
>> Hello Yang,
>> 
>> To better understand how data flows through NiFi to the processors you need to understand FlowFiles. FlowFiles are the data record that gets processed by the processors. FlowFiles are a pointer to content and a collection of attributes. So each time the processor acts on the entire FlowFile produced by the previous processor. 
>> 
>> For clustering, the flow is replicated to each node of the cluster. This means each node in the cluster has a copy of the flow which it uses to process all data sent to it (except for processor's marked as "primary node" only, but that's a bit more advanced).
>> 
>> Also for a better worded, more in-depth look into NiFi I would suggest checking out the PR for the "NiFi In Depth" doc[1]. It would help answer many questions you may have about the internals of NiFi. Also any comments on it are much appreciated.
>> 
>> [1] https://github.com/apache/nifi/pull/339#discussion_r60103526
>> 
>> Joe
>> 
>> - - - - - - Joseph Percivall
>> linkedin.com/in/Percivall
>> e: joepercivall@yahoo.com
>> 
>> 
>> 
>> 
>> On Monday, April 25, 2016 11:21 AM, Yuanzhe Yang (杨远哲) <yy...@gmail.com> wrote:
>> Hi,
>> 
>> I have read some documentation about NiFi, but I haven’t got a clear impression about how data flows inside NiFi. Is it processed streamingly? Or does a processor get the entire intermediate result produced by its previous processor? Moreover, what is the granularity of clustering? Is it dataflow level or processor level?
>> 
>> Thank you very much for your clarification and your work is very much appreciated.
>> 
>> Regards,
>> Yang

Re: Is NiFi processing streamingly?

Posted by Joe Percivall <jo...@yahoo.com.INVALID>.

Hello Yang,

For a cluster with a "Y" style dataflow, each node will have a run a copy of the whole flow. This means that at the merging point, only data within a cluster will get merged.

A little bit of a metaphor: say you want to create toys that combine multiple different parts together (input data) and you have two workers (nodes). The way that NiFi would break up the work is to give each worker the blueprints (the data flow) for the entire toy and each works on the necessary raw materials independently to create their own end product (end merged FlowFile). Raw materials from one worker are never merged with the raw materials of the other, they are worked on independently.

NiFi uses the same concept of isolating the work to independent workers. 

There is a little wiggle room with re-distributing work to the nodes using S2S and using the primary node only scheduling strategy but those are special cases.

Hope that metaphor helps a bit,
Joe- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: joepercivall@yahoo.com

On Monday, April 25, 2016 6:29 PM, Yuanzhe Yang (杨远哲) <yy...@gmail.com> wrote:
Hi Joe,

Thanks a lot for your explanation and suggestion. As for the clustering question, what I actually want to ask is that, for example, when we have a two node cluster and a “Y” style dataflow, will the two nodes work on the two branches respectively? If so, what will happen after the result is merged at the intersection processor? Does one node become idle? 

Regards,
Yang

> On 25 Apr 2016, at 17:44, Joe Percivall <jo...@yahoo.com.INVALID> wrote:
> 
> Hello Yang,
> 
> To better understand how data flows through NiFi to the processors you need to understand FlowFiles. FlowFiles are the data record that gets processed by the processors. FlowFiles are a pointer to content and a collection of attributes. So each time the processor acts on the entire FlowFile produced by the previous processor. 
> 
> For clustering, the flow is replicated to each node of the cluster. This means each node in the cluster has a copy of the flow which it uses to process all data sent to it (except for processor's marked as "primary node" only, but that's a bit more advanced).
> 
> Also for a better worded, more in-depth look into NiFi I would suggest checking out the PR for the "NiFi In Depth" doc[1]. It would help answer many questions you may have about the internals of NiFi. Also any comments on it are much appreciated.
> 
> [1] https://github.com/apache/nifi/pull/339#discussion_r60103526
> 
> Joe
> 
> - - - - - - Joseph Percivall
> linkedin.com/in/Percivall
> e: joepercivall@yahoo.com
> 
> 
> 
> 
> On Monday, April 25, 2016 11:21 AM, Yuanzhe Yang (杨远哲) <yy...@gmail.com> wrote:
> Hi,
> 
> I have read some documentation about NiFi, but I haven’t got a clear impression about how data flows inside NiFi. Is it processed streamingly? Or does a processor get the entire intermediate result produced by its previous processor? Moreover, what is the granularity of clustering? Is it dataflow level or processor level?
> 
> Thank you very much for your clarification and your work is very much appreciated.
> 
> Regards,
> Yang

Re: Is NiFi processing streamingly?

Posted by "Yuanzhe Yang (杨远哲)" <yy...@gmail.com>.

Hi Joe,

Thanks a lot for your explanation and suggestion. As for the clustering question, what I actually want to ask is that, for example, when we have a two node cluster and a “Y” style dataflow, will the two nodes work on the two branches respectively? If so, what will happen after the result is merged at the intersection processor? Does one node become idle? 

Regards,
Yang

> On 25 Apr 2016, at 17:44, Joe Percivall <jo...@yahoo.com.INVALID> wrote:
> 
> Hello Yang,
> 
> To better understand how data flows through NiFi to the processors you need to understand FlowFiles. FlowFiles are the data record that gets processed by the processors. FlowFiles are a pointer to content and a collection of attributes. So each time the processor acts on the entire FlowFile produced by the previous processor. 
> 
> For clustering, the flow is replicated to each node of the cluster. This means each node in the cluster has a copy of the flow which it uses to process all data sent to it (except for processor's marked as "primary node" only, but that's a bit more advanced).
> 
> Also for a better worded, more in-depth look into NiFi I would suggest checking out the PR for the "NiFi In Depth" doc[1]. It would help answer many questions you may have about the internals of NiFi. Also any comments on it are much appreciated.
> 
> [1] https://github.com/apache/nifi/pull/339#discussion_r60103526
> 
> Joe
> 
> - - - - - - Joseph Percivall
> linkedin.com/in/Percivall
> e: joepercivall@yahoo.com
> 
> 
> 
> 
> On Monday, April 25, 2016 11:21 AM, Yuanzhe Yang (杨远哲) <yy...@gmail.com> wrote:
> Hi,
> 
> I have read some documentation about NiFi, but I haven’t got a clear impression about how data flows inside NiFi. Is it processed streamingly? Or does a processor get the entire intermediate result produced by its previous processor? Moreover, what is the granularity of clustering? Is it dataflow level or processor level?
> 
> Thank you very much for your clarification and your work is very much appreciated.
> 
> Regards,
> Yang

Re: Is NiFi processing streamingly?

Posted by Joe Percivall <jo...@yahoo.com.INVALID>.

Hello Yang,

To better understand how data flows through NiFi to the processors you need to understand FlowFiles. FlowFiles are the data record that gets processed by the processors. FlowFiles are a pointer to content and a collection of attributes. So each time the processor acts on the entire FlowFile produced by the previous processor.

For clustering, the flow is replicated to each node of the cluster. This means each node in the cluster has a copy of the flow which it uses to process all data sent to it (except for processor's marked as "primary node" only, but that's a bit more advanced).

Also for a better worded, more in-depth look into NiFi I would suggest checking out the PR for the "NiFi In Depth" doc[1]. It would help answer many questions you may have about the internals of NiFi. Also any comments on it are much appreciated.

[1] https://github.com/apache/nifi/pull/339#discussion_r60103526

Joe

- - - - - - Joseph Percivall
linkedin.com/in/Percivall
e: joepercivall@yahoo.com

On Monday, April 25, 2016 11:21 AM, Yuanzhe Yang (杨远哲) <yy...@gmail.com> wrote:
Hi,

I have read some documentation about NiFi, but I haven’t got a clear impression about how data flows inside NiFi. Is it processed streamingly? Or does a processor get the entire intermediate result produced by its previous processor? Moreover, what is the granularity of clustering? Is it dataflow level or processor level?

Thank you very much for your clarification and your work is very much appreciated.

Regards,
Yang