You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Igor Kravzov <ig...@gmail.com> on 2016/05/01 19:20:52 UTC

NiFi cluster question

If I understand correctly in cluster mode the same dataflow runs on all the
notes.
So let's say I have a simple dataflow with GetTwitter and PutHDFS
processors. And one NCM + 2 nodes.
Does it actually that mean the GetTwitter will be called independently and
potentially simultaneously on each node and there may be duplicate results?
How about PutHDFS processor?  To where "hadoop configuration resources"
"parent HDFS directory" should point to in each node?

Re: NiFi cluster question

Posted by Igor Kravzov <ig...@gmail.com>.
Thanks Joe.

On Sun, May 1, 2016 at 2:55 PM, Joe Witt <jo...@gmail.com> wrote:

> Igor,
>
> There is no automatic failover of the the node that is considered
> primary.  For the upcoming 1.x release though this has been addressed
> https://issues.apache.org/jira/browse/NIFI-483
>
> Thanks
> Joe
>
> On Sun, May 1, 2016 at 2:36 PM, Igor Kravzov <ig...@gmail.com>
> wrote:
> > Thanks Aldrin for the repose.
> > What didn't fully understand from documentation: is automatic fail-over
> > implemented? I would rather configure entire workflow to run "On primary
> > node".
> >
> >
> > On Sun, May 1, 2016 at 1:31 PM, Aldrin Piri <al...@gmail.com>
> wrote:
> >>
> >> Igor,
> >>
> >> Your thoughts are correct, and without any additional configuration, the
> >> GetTwitter processor would run on both nodes.  The way to avoid this is
> to
> >> select the "On primary node" scheduling strategy which would only have
> the
> >> processor run on whichever node is currently primary.
> >>
> >> PutHDFS has similar semantics but these would likely be desired.
> Consider
> >> where data is partitioned across each of the nodes.  PutHDFS would then
> need
> >> to run on each node to ensure the data is delivered to HDFS.  The
> property
> >> you list is that of where the data should land on the configured HDFS
> >> instance.  Often times this is done via Expression Language (EL) to get
> the
> >> familiar time slicing of resources when persisted such as
> >> ${now():format('yyyy/MM/dd/HH')}.  You could additionally have directory
> >> structure that mirrors the data making use of attributes the files may
> have
> >> gained as they made their way through your flow or an UpdateAttribute
> to set
> >> a property, such as "hadoop.dest.dir", that is used by the final PutHDFS
> >> property to give a dynamic location on a per FlowFile basis.
> >>
> >> Let us know if you have additional questions or if things are unclear.
> >>
> >> --aldrin
> >>
> >>
> >> On Sun, May 1, 2016 at 1:20 PM, Igor Kravzov <ig...@gmail.com>
> >> wrote:
> >>>
> >>> If I understand correctly in cluster mode the same dataflow runs on all
> >>> the notes.
> >>> So let's say I have a simple dataflow with GetTwitter and PutHDFS
> >>> processors. And one NCM + 2 nodes.
> >>> Does it actually that mean the GetTwitter will be called independently
> >>> and potentially simultaneously on each node and there may be duplicate
> >>> results?
> >>> How about PutHDFS processor?  To where "hadoop configuration resources"
> >>> "parent HDFS directory" should point to in each node?
> >>
> >>
> >
>

Re: NiFi cluster question

Posted by Joe Witt <jo...@gmail.com>.
Igor,

There is no automatic failover of the the node that is considered
primary.  For the upcoming 1.x release though this has been addressed
https://issues.apache.org/jira/browse/NIFI-483

Thanks
Joe

On Sun, May 1, 2016 at 2:36 PM, Igor Kravzov <ig...@gmail.com> wrote:
> Thanks Aldrin for the repose.
> What didn't fully understand from documentation: is automatic fail-over
> implemented? I would rather configure entire workflow to run "On primary
> node".
>
>
> On Sun, May 1, 2016 at 1:31 PM, Aldrin Piri <al...@gmail.com> wrote:
>>
>> Igor,
>>
>> Your thoughts are correct, and without any additional configuration, the
>> GetTwitter processor would run on both nodes.  The way to avoid this is to
>> select the "On primary node" scheduling strategy which would only have the
>> processor run on whichever node is currently primary.
>>
>> PutHDFS has similar semantics but these would likely be desired.  Consider
>> where data is partitioned across each of the nodes.  PutHDFS would then need
>> to run on each node to ensure the data is delivered to HDFS.  The property
>> you list is that of where the data should land on the configured HDFS
>> instance.  Often times this is done via Expression Language (EL) to get the
>> familiar time slicing of resources when persisted such as
>> ${now():format('yyyy/MM/dd/HH')}.  You could additionally have directory
>> structure that mirrors the data making use of attributes the files may have
>> gained as they made their way through your flow or an UpdateAttribute to set
>> a property, such as "hadoop.dest.dir", that is used by the final PutHDFS
>> property to give a dynamic location on a per FlowFile basis.
>>
>> Let us know if you have additional questions or if things are unclear.
>>
>> --aldrin
>>
>>
>> On Sun, May 1, 2016 at 1:20 PM, Igor Kravzov <ig...@gmail.com>
>> wrote:
>>>
>>> If I understand correctly in cluster mode the same dataflow runs on all
>>> the notes.
>>> So let's say I have a simple dataflow with GetTwitter and PutHDFS
>>> processors. And one NCM + 2 nodes.
>>> Does it actually that mean the GetTwitter will be called independently
>>> and potentially simultaneously on each node and there may be duplicate
>>> results?
>>> How about PutHDFS processor?  To where "hadoop configuration resources"
>>> "parent HDFS directory" should point to in each node?
>>
>>
>

Re: NiFi cluster question

Posted by Igor Kravzov <ig...@gmail.com>.
Thanks Aldrin for the repose.
What didn't fully understand from documentation: is automatic fail-over
implemented? I would rather configure entire workflow to run "On primary
node".


On Sun, May 1, 2016 at 1:31 PM, Aldrin Piri <al...@gmail.com> wrote:

> Igor,
>
> Your thoughts are correct, and without any additional configuration, the
> GetTwitter processor would run on both nodes.  The way to avoid this is to
> select the "On primary node" scheduling strategy which would only have the
> processor run on whichever node is currently primary.
>
> PutHDFS has similar semantics but these would likely be desired.  Consider
> where data is partitioned across each of the nodes.  PutHDFS would then
> need to run on each node to ensure the data is delivered to HDFS.  The
> property you list is that of where the data should land on the configured
> HDFS instance.  Often times this is done via Expression Language (EL) to
> get the familiar time slicing of resources when persisted such as
> ${now():format('yyyy/MM/dd/HH')}.  You could additionally have directory
> structure that mirrors the data making use of attributes the files may have
> gained as they made their way through your flow or an UpdateAttribute to
> set a property, such as "hadoop.dest.dir", that is used by the final
> PutHDFS property to give a dynamic location on a per FlowFile basis.
>
> Let us know if you have additional questions or if things are unclear.
>
> --aldrin
>
>
>
> On Sun, May 1, 2016 at 1:20 PM, Igor Kravzov <ig...@gmail.com>
> wrote:
>
>> If I understand correctly in cluster mode the same dataflow runs on all
>> the notes.
>> So let's say I have a simple dataflow with GetTwitter and PutHDFS
>> processors. And one NCM + 2 nodes.
>> Does it actually that mean the GetTwitter will be called independently
>> and potentially simultaneously on each node and there may be duplicate
>> results?
>> How about PutHDFS processor?  To where "hadoop configuration resources"
>> "parent HDFS directory" should point to in each node?
>>
>
>

Re: NiFi cluster question

Posted by Aldrin Piri <al...@gmail.com>.
Igor,

Your thoughts are correct, and without any additional configuration, the
GetTwitter processor would run on both nodes.  The way to avoid this is to
select the "On primary node" scheduling strategy which would only have the
processor run on whichever node is currently primary.

PutHDFS has similar semantics but these would likely be desired.  Consider
where data is partitioned across each of the nodes.  PutHDFS would then
need to run on each node to ensure the data is delivered to HDFS.  The
property you list is that of where the data should land on the configured
HDFS instance.  Often times this is done via Expression Language (EL) to
get the familiar time slicing of resources when persisted such as
${now():format('yyyy/MM/dd/HH')}.  You could additionally have directory
structure that mirrors the data making use of attributes the files may have
gained as they made their way through your flow or an UpdateAttribute to
set a property, such as "hadoop.dest.dir", that is used by the final
PutHDFS property to give a dynamic location on a per FlowFile basis.

Let us know if you have additional questions or if things are unclear.

--aldrin



On Sun, May 1, 2016 at 1:20 PM, Igor Kravzov <ig...@gmail.com> wrote:

> If I understand correctly in cluster mode the same dataflow runs on all
> the notes.
> So let's say I have a simple dataflow with GetTwitter and PutHDFS
> processors. And one NCM + 2 nodes.
> Does it actually that mean the GetTwitter will be called independently and
> potentially simultaneously on each node and there may be duplicate results?
> How about PutHDFS processor?  To where "hadoop configuration resources"
> "parent HDFS directory" should point to in each node?
>