You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Sivaprasanna <si...@gmail.com> on 2018/02/10 03:55:04 UTC

Implementation of ListFile's Primary Node only in a cluster

I was going through ListFile processor's code and found out that in the
documentation
<https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ListFile.java#L72-L76>,
it is mentioned that "this processor is designed to run on Primary Node
only in a cluster". I want to understand what "designed" stands for here.
Does that mean the processor was built in a way that it only runs on the
Primary node regardless of the "Execution Nodes" strategy set to otherwise
or does it mean that dataflow manager/developer is expected to set the
'Execution Nodes' strategy to "Primary Node" at the time of flow design? If
it is of the former case, how is it handled in the code? If it is handled,
it should be in the framework side but I don't see any annotation
indicating anything related to such mechanism in the processor code and
more over a related JIRA NIFI-543
<https://issues.apache.org/jira/browse/NIFI-543> is also open so I want
clear my doubt.

-
Sivaprasanna

Re: Implementation of ListFile's Primary Node only in a cluster

Posted by Sivaprasanna <si...@gmail.com>.

Think it was a cache issue. It works as intended. Looks like removing the
executionNode === PRIMARY from nf-processor-details.js
<https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-ui/src/main/webapp/js/nf/nf-processor-details.js#L220>
and nf-processor-configuration.js
<https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-ui/src/main/webapp/js/nf/canvas/nf-processor-configuration.js#L745>
alone is enough. However, I want to confirm it here with the community
whether it is okay to remove that.

On Fri, Feb 23, 2018 at 10:28 PM, Sivaprasanna <si...@gmail.com>
wrote:

> I have started working on an annotation implementation wherein the
> developer can use that annotation to indicate that processor is supposed to
> be set to run only on 'Primary node'. Framework side of things work just
> fine. However, for UI side there are a couple of questions and issues:
>
>    1. nf-processor-details.js
> <https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-ui/src/main/webapp/js/nf/nf-processor-details.js#L220>
> and nf-processor-configuration.js
> <https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-ui/src/main/webapp/js/nf/canvas/nf-processor-configuration.js#L745>
> checks if the setup 'isClustered' or 'executionNode === PRIMARY' which
> confuses me. Checking ' nfClusterSummary.isClustered()' alone is enough,
> right? The reason is, since we are also checking 'executionNode ===
> Primary', even for single instance NiFi i.e. non clustered setup, the
> 'execution-node-options' will be rendered for processors marked with this
> annotation.
>    2. In order to avoid this, I made a change to the code and removed the
> 'executionNode === PRIMARY' condition check in the mentioned files. Even
> after that, 'execution-node-options' is being rendered. Am I missing
> something?
>
> I have pushed these changes to my remote repo. Here is the link:
> https://github.com/zenfenan/nifi/commit/e09e85960fb394eeef89d9cb6aa7ac
> dfc5d4dad3
>
> BTW, right now I have implemented it in this way : If the annotation is
> present, at the time of processor creation/instantiation, the executionNode
> will be set to 'PRIMARY'. However this can be changed later by configuring
> the processor from the UI. Should we think about disabling the 'Execution
> Node' configuration altogether (from UI) for a processor marked with this
> annotation (which makes more sense to me but kinda seems to be restricting
> the users' liberty from choosing according their wish) ?
>
>
> On Sun, Feb 11, 2018 at 12:59 AM, Bryan Bende <bb...@gmail.com> wrote:
>
>> Currently it means that the dataflow manager/developer is expected to
>> set the 'Execution Nodes' strategy to "Primary Node" at the time of
>> flow design.
>>
>> We don't have anything that restricts the scheduling strategy of a
>> processor, but we probably should consider having an annotation like
>> @PrimaryNodeOnly that you can put on a processor and then the
>> framework will enforce that it can only be scheduled on primary node.
>>
>> In the case of ListFile, I think the statement in the documentation is
>> only partially true...
>>
>> When "Input Directory Location" is set to local, there should be no
>> issue with scheduling the processor on all nodes in the cluster, as it
>> would be listing a local directory and storing state locally.
>>
>> When "Input Directory Location" is set to remote, it wouldn't make
>> sense to have all nodes listing the same remote directory and getting
>> the same results, and also the state is then stored in ZooKeeper under
>> a ZNode using the processor's UUID, and the processor has the same
>> UUID on each node so they would be overwriting each other's state in
>> ZK.
>>
>> So ListFile probably can't be restricted to primary node only, where
>> as something like ListHDFS probably could because it is always listing
>> a remote destination.
>>
>>
>> On Fri, Feb 9, 2018 at 10:55 PM, Sivaprasanna <si...@gmail.com>
>> wrote:
>> > I was going through ListFile processor's code and found out that in the
>> > documentation
>> > <https://github.com/apache/nifi/blob/master/nifi-nar-bundles
>> /nifi-standard-bundle/nifi-standard-processors/src/main/
>> java/org/apache/nifi/processors/standard/ListFile.java#L72-L76>,
>> > it is mentioned that "this processor is designed to run on Primary Node
>> > only in a cluster". I want to understand what "designed" stands for
>> here.
>> > Does that mean the processor was built in a way that it only runs on the
>> > Primary node regardless of the "Execution Nodes" strategy set to
>> otherwise
>> > or does it mean that dataflow manager/developer is expected to set the
>> > 'Execution Nodes' strategy to "Primary Node" at the time of flow
>> design? If
>> > it is of the former case, how is it handled in the code? If it is
>> handled,
>> > it should be in the framework side but I don't see any annotation
>> > indicating anything related to such mechanism in the processor code and
>> > more over a related JIRA NIFI-543
>> > <https://issues.apache.org/jira/browse/NIFI-543> is also open so I want
>> > clear my doubt.
>> >
>> > -
>> > Sivaprasanna
>>
>
>

Re: Implementation of ListFile's Primary Node only in a cluster

Posted by Sivaprasanna <si...@gmail.com>.

I have started working on an annotation implementation wherein the
developer can use that annotation to indicate that processor is supposed to
be set to run only on 'Primary node'. Framework side of things work just
fine. However, for UI side there are a couple of questions and issues:

   1. nf-processor-details.js
<https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-ui/src/main/webapp/js/nf/nf-processor-details.js#L220>
and nf-processor-configuration.js
<https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-ui/src/main/webapp/js/nf/canvas/nf-processor-configuration.js#L745>
checks if the setup 'isClustered' or 'executionNode === PRIMARY' which
confuses me. Checking ' nfClusterSummary.isClustered()' alone is enough,
right? The reason is, since we are also checking 'executionNode ===
Primary', even for single instance NiFi i.e. non clustered setup, the
'execution-node-options' will be rendered for processors marked with this
annotation.
   2. In order to avoid this, I made a change to the code and removed the
'executionNode === PRIMARY' condition check in the mentioned files. Even
after that, 'execution-node-options' is being rendered. Am I missing
something?

I have pushed these changes to my remote repo. Here is the link:
https://github.com/zenfenan/nifi/commit/e09e85960fb394eeef89d9cb6aa7acdfc5d4dad3

BTW, right now I have implemented it in this way : If the annotation is
present, at the time of processor creation/instantiation, the executionNode
will be set to 'PRIMARY'. However this can be changed later by configuring
the processor from the UI. Should we think about disabling the 'Execution
Node' configuration altogether (from UI) for a processor marked with this
annotation (which makes more sense to me but kinda seems to be restricting
the users' liberty from choosing according their wish) ?

On Sun, Feb 11, 2018 at 12:59 AM, Bryan Bende <bb...@gmail.com> wrote:

> Currently it means that the dataflow manager/developer is expected to
> set the 'Execution Nodes' strategy to "Primary Node" at the time of
> flow design.
>
> We don't have anything that restricts the scheduling strategy of a
> processor, but we probably should consider having an annotation like
> @PrimaryNodeOnly that you can put on a processor and then the
> framework will enforce that it can only be scheduled on primary node.
>
> In the case of ListFile, I think the statement in the documentation is
> only partially true...
>
> When "Input Directory Location" is set to local, there should be no
> issue with scheduling the processor on all nodes in the cluster, as it
> would be listing a local directory and storing state locally.
>
> When "Input Directory Location" is set to remote, it wouldn't make
> sense to have all nodes listing the same remote directory and getting
> the same results, and also the state is then stored in ZooKeeper under
> a ZNode using the processor's UUID, and the processor has the same
> UUID on each node so they would be overwriting each other's state in
> ZK.
>
> So ListFile probably can't be restricted to primary node only, where
> as something like ListHDFS probably could because it is always listing
> a remote destination.
>
>
> On Fri, Feb 9, 2018 at 10:55 PM, Sivaprasanna <si...@gmail.com>
> wrote:
> > I was going through ListFile processor's code and found out that in the
> > documentation
> > <https://github.com/apache/nifi/blob/master/nifi-nar-
> bundles/nifi-standard-bundle/nifi-standard-processors/src/
> main/java/org/apache/nifi/processors/standard/ListFile.java#L72-L76>,
> > it is mentioned that "this processor is designed to run on Primary Node
> > only in a cluster". I want to understand what "designed" stands for here.
> > Does that mean the processor was built in a way that it only runs on the
> > Primary node regardless of the "Execution Nodes" strategy set to
> otherwise
> > or does it mean that dataflow manager/developer is expected to set the
> > 'Execution Nodes' strategy to "Primary Node" at the time of flow design?
> If
> > it is of the former case, how is it handled in the code? If it is
> handled,
> > it should be in the framework side but I don't see any annotation
> > indicating anything related to such mechanism in the processor code and
> > more over a related JIRA NIFI-543
> > <https://issues.apache.org/jira/browse/NIFI-543> is also open so I want
> > clear my doubt.
> >
> > -
> > Sivaprasanna
>

Re: Implementation of ListFile's Primary Node only in a cluster

Posted by Bryan Bende <bb...@gmail.com>.

Currently it means that the dataflow manager/developer is expected to
set the 'Execution Nodes' strategy to "Primary Node" at the time of
flow design.

We don't have anything that restricts the scheduling strategy of a
processor, but we probably should consider having an annotation like
@PrimaryNodeOnly that you can put on a processor and then the
framework will enforce that it can only be scheduled on primary node.

In the case of ListFile, I think the statement in the documentation is
only partially true...

When "Input Directory Location" is set to local, there should be no
issue with scheduling the processor on all nodes in the cluster, as it
would be listing a local directory and storing state locally.

When "Input Directory Location" is set to remote, it wouldn't make
sense to have all nodes listing the same remote directory and getting
the same results, and also the state is then stored in ZooKeeper under
a ZNode using the processor's UUID, and the processor has the same
UUID on each node so they would be overwriting each other's state in
ZK.

So ListFile probably can't be restricted to primary node only, where
as something like ListHDFS probably could because it is always listing
a remote destination.

On Fri, Feb 9, 2018 at 10:55 PM, Sivaprasanna <si...@gmail.com> wrote:
> I was going through ListFile processor's code and found out that in the
> documentation
> <https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ListFile.java#L72-L76>,
> it is mentioned that "this processor is designed to run on Primary Node
> only in a cluster". I want to understand what "designed" stands for here.
> Does that mean the processor was built in a way that it only runs on the
> Primary node regardless of the "Execution Nodes" strategy set to otherwise
> or does it mean that dataflow manager/developer is expected to set the
> 'Execution Nodes' strategy to "Primary Node" at the time of flow design? If
> it is of the former case, how is it handled in the code? If it is handled,
> it should be in the framework side but I don't see any annotation
> indicating anything related to such mechanism in the processor code and
> more over a related JIRA NIFI-543
> <https://issues.apache.org/jira/browse/NIFI-543> is also open so I want
> clear my doubt.
>
> -
> Sivaprasanna