You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by David Klim <da...@hotmail.com> on 2015/04/28 20:06:10 UTC

New to NiFi and interested on clustering capabilities

Hello,
Just joined the list, I am evaluating NiFi for a large project to see if NiFi would fit as the main data collector. So far I am quite impressed with it's capabilities, the concept is just great!
The project I am working on would require retrieving several hundreds of millions of files per day (hundreds of TB per day) so my first question is how to achieve distribution/clustering with NiFi, if that's possible.
Thanks in advance!




 		 	   		  

Re: New to NiFi and interested on clustering capabilities

Posted by Joe Witt <jo...@gmail.com>.
David,

Glad you are liking it!  As you explore and learn more feel free to
fire away with feedback on things you find confusing, too limiting, or
awesome (not necessarily in that order).

Each node in a cluster should be able to handle 100s of thousands of
operations per second on the data and several hundred MB/s of
read/write throughput.  So the scale you're suggesting is feasible.
Of course at those numbers the details do tend to matter so if you
have questions just fire away.

Thanks
Joe



On Tue, Apr 28, 2015 at 2:33 PM, Matt Gilman <ma...@gmail.com> wrote:
> David,
>
> Welcome and thanks for expressing interest in Apache NiFi. I just noticed
> that the administrator guide [1] on our website [2] was not in its current
> form so just uploaded the latest version. The document now includes a quick
> explanation of our clustering capabilities and example configurations. This
> would be a great place to start and become familiar with NiFi clustering.
> Please let us know if you have any follow up questions.
>
> Also, if you had already viewed the administrator guide your browser may
> have cached the older version so you may need to do a hard reload.
>
> [1]
> https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html
> [2] https://nifi.incubator.apache.org/
>
> On Tue, Apr 28, 2015 at 2:06 PM, David Klim <da...@hotmail.com> wrote:
>>
>> Hello,
>>
>> Just joined the list, I am evaluating NiFi for a large project to see if
>> NiFi would fit as the main data collector. So far I am quite impressed with
>> it's capabilities, the concept is just great!
>>
>> The project I am working on would require retrieving several hundreds of
>> millions of files per day (hundreds of TB per day) so my first question is
>> how to achieve distribution/clustering with NiFi, if that's possible.
>>
>> Thanks in advance!
>>
>>
>>
>>
>>
>

Re: New to NiFi and interested on clustering capabilities

Posted by Aldrin Piri <al...@gmail.com>.
David,

As to your first question, the way files are distributed in the cluster is
primarily by their system of ingress.  More specifically, if a file is
pulled via a given processor on one of your nodes, the data continues its
full journey through the configured data flow on that physical system.
Given the context of your second inquiry, I'd imagine a possible pain point
would be having a central point introducing all the files to your system.
One way that you can provide better distribution is via a fan out type of
approach.  Use your single, isolated processor to introduce data to the
system and then provide a DistributeLoad processor to provide 1/n of your
data to each of the other systems in your n-node cluster. The sending of
data to remote systems, could be accomplished with n-1 PostHTTP processors
configured to point to one of your n-instances and is received by a
ListenHTTP.  The "local" system would be able to bypass the network
send/receive and just go straight into the flow.  This is not ideal for
what I perceive to be your case, and there is an associated issue with the
request for the feature to support such a balancing automatically [2].  Any
additional thoughts you had on the issue would be appreciated.


The isolation mode you are looking for is available in a clustered flow via
the processor configuration.  Select the scheduling tab, and for scheduling
strategy, select "On primary node."  Just to call out it out explicitly,
this option will only show up in a clustered flow.

There is at least one related ticket to this subject, NIFI-401 [1].  If
this particular method doesn't quite meet your use case, we would
definitely like to hear about suggestions or opinions on how to make this
better.

[1] https://issues.apache.org/jira/browse/NIFI-401
[2] https://issues.apache.org/jira/browse/NIFI-337

On Wed, Apr 29, 2015 at 1:49 PM, David Klim <da...@hotmail.com> wrote:

> Thanks you all for the information :)
>
> There is some detail I am missing which is how a defined flow gets
> partitioned across the nodes in the cluster. The now updated doc states "the
> same dataflow runs on all the nodes. As a result, every component in the
> flow runs on every node". How the files are partitioned to be collected
> by different nodes is relevant for the solution I am working on (at least
> could have implications on the definition of the dataflow itself) so I
> would like to dig in here.
>
> The doc also says "the DFM could configure the GetSFTP on the Primary
> Node to run in isolation, meaning that it only runs on that node".  I was
> trying to find this "isolation" configuration but no luck. Any hints? :-)
>
> Thanks again!
>
>
> ------------------------------
> Date: Wed, 29 Apr 2015 08:28:02 -0400
> Subject: Re: New to NiFi and interested on clustering capabilities
> From: matt.c.gilman@gmail.com
> To: users@nifi.incubator.apache.org
>
>
> Anup,
>
> That section is still incomplete unfortunately. We are definitely pushing
> the documentation at the moment. Personally, I am working through getting
> our REST endpoints documented. I know another committer has been working on
> the contribution guide as well some introduction to NiFi quick start
> guides. I can provide some quick points here in the meantime.
>
> In the section for web properties you'll want to configure the 'https'
> properties instead of the 'http' properties.
>
> nifi.web.http.host=
> nifi.web.http.port=
> nifi.web.https.host=
> nifi.web.https.port=
>
> The further down you'll need to configure the security properties.
>
> nifi.security.keystore=
> nifi.security.keystoreType=
> nifi.security.keystorePasswd=
> nifi.security.keyPasswd=
> nifi.security.truststore=
> nifi.security.truststoreType=
> nifi.security.truststorePasswd=
> nifi.security.needClientAuth=
>
> These will define the certificates that are used by the web server (and
> cluster and site to site communications). You will need to configure all
> the keystore properties and truststore properties (if keyPasswd is not
> configured the keystorePasswd will be tried as the keyPasswd). If you set
> needClientAuth to false, clients will be required to trust the keystore
> configured here. User access will still be anonymous however. If you set
> needClientAuth to true, clients will need to have certificates loaded in
> their browser that are trusted by the truststore configured here. User
> access will be considered using the DN from their certificate and the
> authorization provider.
>
> NiFi supports pluggable authorization which is only necessary if
> needClientAuth is set to true. By default its configured with a file based
> solution.
>
> nifi.security.user.authority.provider=file-provider
>
> Details on setting up this file and controlling the level of access have
> started being discussed here [1].
>
> Hope this helps while we get more detailed documentation written up.
> Thanks.
>
> Matt
>
> [1]
> https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html#controlling-levels-of-access
>
>
> On Wed, Apr 29, 2015 at 7:13 AM, Sethuram, Anup <anup.sethuram@philips.com
> > wrote:
>
>  Hi David,
> Is the “Security Configuration” added in the latest admin guide?
>
>  Regards,
> anup
>
>   From: Matt Gilman <ma...@gmail.com>
> Reply-To: "users@nifi.incubator.apache.org" <
> users@nifi.incubator.apache.org>
> Date: Wednesday, 29 April 2015 12:03 am
> To: "users@nifi.incubator.apache.org" <us...@nifi.incubator.apache.org>
> Subject: Re: New to NiFi and interested on clustering capabilities
>
>   David,
>
>  Welcome and thanks for expressing interest in Apache NiFi. I just
> noticed that the administrator guide [1] on our website [2] was not in its
> current form so just uploaded the latest version. The document now includes
> a quick explanation of our clustering capabilities and example
> configurations. This would be a great place to start and become familiar
> with NiFi clustering. Please let us know if you have any follow up
> questions.
>
>  Also, if you had already viewed the administrator guide your browser may
> have cached the older version so you may need to do a hard reload.
>
>  [1]
> https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html
> [2] https://nifi.incubator.apache.org/
>
> On Tue, Apr 28, 2015 at 2:06 PM, David Klim <da...@hotmail.com>
> wrote:
>
>  Hello,
>
>  Just joined the list, I am evaluating NiFi for a large project to see if
> NiFi would fit as the main data collector. So far I am quite impressed with
> it's capabilities, the concept is just great!
>
>  The project I am working on would require retrieving several hundreds of
> millions of files per day (hundreds of TB per day) so my first question is
> how to achieve distribution/clustering with NiFi, if that's possible.
>
>  Thanks in advance!
>
>
>
>
>
>
>
> ------------------------------
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>
>
>

RE: New to NiFi and interested on clustering capabilities

Posted by David Klim <da...@hotmail.com>.
Thanks you all for the information :)
There is some detail I am missing which is how a defined flow gets partitioned across the nodes in the cluster. The now updated doc states "the same dataflow runs on all the nodes. As a result, every component in the flow runs on every node". How the files are partitioned to be collected by different nodes is relevant for the solution I am working on (at least could have implications on the definition of the dataflow itself) so I would like to dig in here.








The doc also says "the DFM could configure the GetSFTP on the Primary Node to run in isolation, meaning that it only runs on that node".  I was trying to find this "isolation" configuration but no luck. Any hints? :-)
Thanks again!









Date: Wed, 29 Apr 2015 08:28:02 -0400
Subject: Re: New to NiFi and interested on clustering capabilities
From: matt.c.gilman@gmail.com
To: users@nifi.incubator.apache.org

Anup,
That section is still incomplete unfortunately. We are definitely pushing the documentation at the moment. Personally, I am working through getting our REST endpoints documented. I know another committer has been working on the contribution guide as well some introduction to NiFi quick start guides. I can provide some quick points here in the meantime.  
In the section for web properties you'll want to configure the 'https' properties instead of the 'http' properties.
nifi.web.http.host=nifi.web.http.port=nifi.web.https.host=nifi.web.https.port=
The further down you'll need to configure the security properties.
nifi.security.keystore=nifi.security.keystoreType=nifi.security.keystorePasswd=nifi.security.keyPasswd=nifi.security.truststore=nifi.security.truststoreType=nifi.security.truststorePasswd=nifi.security.needClientAuth=
These will define the certificates that are used by the web server (and cluster and site to site communications). You will need to configure all the keystore properties and truststore properties (if keyPasswd is not configured the keystorePasswd will be tried as the keyPasswd). If you set needClientAuth to false, clients will be required to trust the keystore configured here. User access will still be anonymous however. If you set needClientAuth to true, clients will need to have certificates loaded in their browser that are trusted by the truststore configured here. User access will be considered using the DN from their certificate and the authorization provider.
NiFi supports pluggable authorization which is only necessary if needClientAuth is set to true. By default its configured with a file based solution.
nifi.security.user.authority.provider=file-provider

Details on setting up this file and controlling the level of access have started being discussed here [1].
Hope this helps while we get more detailed documentation written up. Thanks.
Matt
[1] https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html#controlling-levels-of-access

On Wed, Apr 29, 2015 at 7:13 AM, Sethuram, Anup <an...@philips.com> wrote:





Hi David,
Is the “Security Configuration” added in the latest admin guide?



Regards,
anup





From: Matt Gilman <ma...@gmail.com>

Reply-To: "users@nifi.incubator.apache.org" <us...@nifi.incubator.apache.org>

Date: Wednesday, 29 April 2015 12:03 am

To: "users@nifi.incubator.apache.org" <us...@nifi.incubator.apache.org>

Subject: Re: New to NiFi and interested on clustering capabilities







David,



Welcome and thanks for expressing interest in Apache NiFi. I just noticed that the administrator guide [1] on our website [2] was not in its current form so just uploaded the latest version. The document now includes a quick explanation of our clustering
 capabilities and example configurations. This would be a great place to start and become familiar with NiFi clustering. Please let us know if you have any follow up questions.



Also, if you had already viewed the administrator guide your browser may have cached the older version so you may need to do a hard reload. 



[1] https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html
[2] https://nifi.incubator.apache.org/





On Tue, Apr 28, 2015 at 2:06 PM, David Klim 
<da...@hotmail.com> wrote:



Hello,



Just joined the list, I am evaluating NiFi for a large project to see if NiFi would fit as the main data collector. So far I am quite impressed with it's capabilities, the concept is just great!



The project I am working on would require retrieving several hundreds of millions of files per day (hundreds of TB per day) so my first question is how to achieve distribution/clustering with NiFi, if that's possible.



Thanks in advance!



























The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified
 that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.





 		 	   		  

Re: New to NiFi and interested on clustering capabilities

Posted by Matt Gilman <ma...@gmail.com>.
Anup,

That section is still incomplete unfortunately. We are definitely pushing
the documentation at the moment. Personally, I am working through getting
our REST endpoints documented. I know another committer has been working on
the contribution guide as well some introduction to NiFi quick start
guides. I can provide some quick points here in the meantime.

In the section for web properties you'll want to configure the 'https'
properties instead of the 'http' properties.

nifi.web.http.host=
nifi.web.http.port=
nifi.web.https.host=
nifi.web.https.port=

The further down you'll need to configure the security properties.

nifi.security.keystore=
nifi.security.keystoreType=
nifi.security.keystorePasswd=
nifi.security.keyPasswd=
nifi.security.truststore=
nifi.security.truststoreType=
nifi.security.truststorePasswd=
nifi.security.needClientAuth=

These will define the certificates that are used by the web server (and
cluster and site to site communications). You will need to configure all
the keystore properties and truststore properties (if keyPasswd is not
configured the keystorePasswd will be tried as the keyPasswd). If you set
needClientAuth to false, clients will be required to trust the keystore
configured here. User access will still be anonymous however. If you set
needClientAuth to true, clients will need to have certificates loaded in
their browser that are trusted by the truststore configured here. User
access will be considered using the DN from their certificate and the
authorization provider.

NiFi supports pluggable authorization which is only necessary if
needClientAuth is set to true. By default its configured with a file based
solution.

nifi.security.user.authority.provider=file-provider

Details on setting up this file and controlling the level of access have
started being discussed here [1].

Hope this helps while we get more detailed documentation written up. Thanks.

Matt

[1]
https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html#controlling-levels-of-access


On Wed, Apr 29, 2015 at 7:13 AM, Sethuram, Anup <an...@philips.com>
wrote:

>  Hi David,
> Is the “Security Configuration” added in the latest admin guide?
>
>  Regards,
> anup
>
>   From: Matt Gilman <ma...@gmail.com>
> Reply-To: "users@nifi.incubator.apache.org" <
> users@nifi.incubator.apache.org>
> Date: Wednesday, 29 April 2015 12:03 am
> To: "users@nifi.incubator.apache.org" <us...@nifi.incubator.apache.org>
> Subject: Re: New to NiFi and interested on clustering capabilities
>
>   David,
>
>  Welcome and thanks for expressing interest in Apache NiFi. I just
> noticed that the administrator guide [1] on our website [2] was not in its
> current form so just uploaded the latest version. The document now includes
> a quick explanation of our clustering capabilities and example
> configurations. This would be a great place to start and become familiar
> with NiFi clustering. Please let us know if you have any follow up
> questions.
>
>  Also, if you had already viewed the administrator guide your browser may
> have cached the older version so you may need to do a hard reload.
>
>  [1]
> https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html
> [2] https://nifi.incubator.apache.org/
>
> On Tue, Apr 28, 2015 at 2:06 PM, David Klim <da...@hotmail.com>
> wrote:
>
>>  Hello,
>>
>>  Just joined the list, I am evaluating NiFi for a large project to see
>> if NiFi would fit as the main data collector. So far I am quite impressed
>> with it's capabilities, the concept is just great!
>>
>>  The project I am working on would require retrieving several hundreds
>> of millions of files per day (hundreds of TB per day) so my first question
>> is how to achieve distribution/clustering with NiFi, if that's possible.
>>
>>  Thanks in advance!
>>
>>
>>
>>
>>
>>
>
> ------------------------------
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Re: New to NiFi and interested on clustering capabilities

Posted by "Sethuram, Anup" <an...@philips.com>.
Hi David,
Is the “Security Configuration” added in the latest admin guide?

Regards,
anup

From: Matt Gilman <ma...@gmail.com>>
Reply-To: "users@nifi.incubator.apache.org<ma...@nifi.incubator.apache.org>" <us...@nifi.incubator.apache.org>>
Date: Wednesday, 29 April 2015 12:03 am
To: "users@nifi.incubator.apache.org<ma...@nifi.incubator.apache.org>" <us...@nifi.incubator.apache.org>>
Subject: Re: New to NiFi and interested on clustering capabilities

David,

Welcome and thanks for expressing interest in Apache NiFi. I just noticed that the administrator guide [1] on our website [2] was not in its current form so just uploaded the latest version. The document now includes a quick explanation of our clustering capabilities and example configurations. This would be a great place to start and become familiar with NiFi clustering. Please let us know if you have any follow up questions.

Also, if you had already viewed the administrator guide your browser may have cached the older version so you may need to do a hard reload.

[1] https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html
[2] https://nifi.incubator.apache.org/

On Tue, Apr 28, 2015 at 2:06 PM, David Klim <da...@hotmail.com>> wrote:
Hello,

Just joined the list, I am evaluating NiFi for a large project to see if NiFi would fit as the main data collector. So far I am quite impressed with it's capabilities, the concept is just great!

The project I am working on would require retrieving several hundreds of millions of files per day (hundreds of TB per day) so my first question is how to achieve distribution/clustering with NiFi, if that's possible.

Thanks in advance!







________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: New to NiFi and interested on clustering capabilities

Posted by Matt Gilman <ma...@gmail.com>.
David,

Welcome and thanks for expressing interest in Apache NiFi. I just noticed
that the administrator guide [1] on our website [2] was not in its current
form so just uploaded the latest version. The document now includes a quick
explanation of our clustering capabilities and example configurations. This
would be a great place to start and become familiar with NiFi clustering.
Please let us know if you have any follow up questions.

Also, if you had already viewed the administrator guide your browser may
have cached the older version so you may need to do a hard reload.

[1]
https://nifi.incubator.apache.org/docs/nifi-docs/administration-guide.html
[2] https://nifi.incubator.apache.org/

On Tue, Apr 28, 2015 at 2:06 PM, David Klim <da...@hotmail.com> wrote:

> Hello,
>
> Just joined the list, I am evaluating NiFi for a large project to see if
> NiFi would fit as the main data collector. So far I am quite impressed with
> it's capabilities, the concept is just great!
>
> The project I am working on would require retrieving several hundreds of
> millions of files per day (hundreds of TB per day) so my first question is
> how to achieve distribution/clustering with NiFi, if that's possible.
>
> Thanks in advance!
>
>
>
>
>
>