You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Manish Gupta 8 <mg...@sapient.com> on 2016/08/08 11:55:56 UTC

Processors in cluster mode

Hi,

I am running a multi-node NiFi (0.7.0) cluster and trying to implement a streaming ingestion pipeline (@ 200 MB/s at peak and around 30 MB/s at non-peak hours) and routing to different destinations (Kafka, Azure Storage, HDFS). The dataflow will be exposing a TCP port for incoming data and will also be ingesting files from folder, database records etc.

It would be great if someone can provide a link/doc that explains how processors can be expected to behave in a multi-node environment.
My doubts are about how some of the processors work in a clustered mode, and the meaning of concurrent tasks.

For example:


*         ListenTCP:

o   When this processor is scheduled to run on a cluster (and not on the primary node), then does it mean I need to send data to all the individual nodes manually i.e. specify each node's host names separately? If I don't specify hosts individually and only provide let's say primary node's host name from producer, will all the other nodes remain idle? Or NiFi tries to distribute the data to other nodes using some routing strategy? I am trying to increase the throughput and thinking something like this as data producer strategy:



[cid:image004.jpg@01D1F199.EB0A8300]



And consumer will be simply as following:

[cid:image003.png@01D1F199.E94A3560]





o   When I increase the number of concurrent tasks, does it make multiple copies of buffer/channel reader etc.? Or is it only the processing which gets multiplied?

*         Get / Fetch File:

o   Can we assume that when this processor is running on multiple nodes and threads, the same file will never get pulled multiple times as a flow-file?

*         Distribute Load Processor:

o   When this processor is running on multiple nodes, will all the incoming flow files go to each instance of running node? And this question is for any processor that run on a cluster and has to consume an incoming flow-file? What's the general routing strategy in NiFi when a processor is running on multiple node?

*         ExecuteSQL

o   Will all the running instances on all the nodes be hitting the RDBMS to fetch the data for the same query leading to duplicates, and heavy load on database?

Thanks,
Manish

RE: Processors in cluster mode

Posted by Manish Gupta 8 <mg...@sapient.com>.

Thanks Bryan. This is very helpful.

Regards,
Manish

From: Bryan Bende [mailto:bbende@gmail.com]
Sent: Tuesday, August 09, 2016 12:50 AM
To: users@nifi.apache.org
Subject: Re: Processors in cluster mode

Hi Manish,

This post [1] has an overview of how to distribute data across your NiFi cluster.

In general though, NiFi runs the same flow on each node and the data needs to be divided across the nodes appropriately depending on the situation.
The only exception to running the same flow on every node is when a processor is scheduled to run Primary Node only.

Concurrent Tasks is the number of threads that will concurrently call a given instance of a processor. So if you have processor "Foo" and a three node cluster, and set concurrent tasks to 2, there will be three instances of Foo and each will have two threads calling the onTrigger method.

For some of your specific cases...

ListenTCP - You would have an instance of this process on all three nodes and need the producer to send to all of them, or have a load balancer that supports TCP sitting in front of the nodes and have the producer send to the load balancer.
Get/Fetch File - These pick up files from the local filesystem so it would be up to the data producer to send/write files on each node of the cluster for each instance of this processor to pick up.
Distribute Load Processor - There will be a Distribute Load processor running on each node and operating on only the flow files on that node.
ExecuteSQL - Typically you would run this on primary node only, or in an upcoming release there is going to be some more options with a ListDatabaseTable processor that can produce instructions than can be distributed across a cluster to your ExecuteSQL processor.

Hope that helps.

Thanks,

Bryan

[1] https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

On Mon, Aug 8, 2016 at 7:55 AM, Manish Gupta 8 <mg...@sapient.com>> wrote:
Hi,

I am running a multi-node NiFi (0.7.0) cluster and trying to implement a streaming ingestion pipeline (@ 200 MB/s at peak and around 30 MB/s at non-peak hours) and routing to different destinations (Kafka, Azure Storage, HDFS). The dataflow will be exposing a TCP port for incoming data and will also be ingesting files from folder, database records etc.

It would be great if someone can provide a link/doc that explains how processors can be expected to behave in a multi-node environment.
My doubts are about how some of the processors work in a clustered mode, and the meaning of concurrent tasks.

For example:

•         ListenTCP:

o   When this processor is scheduled to run on a cluster (and not on the primary node), then does it mean I need to send data to all the individual nodes manually i.e. specify each node’s host names separately? If I don’t specify hosts individually and only provide let’s say primary node’s host name from producer, will all the other nodes remain idle? Or NiFi tries to distribute the data to other nodes using some routing strategy? I am trying to increase the throughput and thinking something like this as data producer strategy:

[cid:image001.jpg@01D1F22A.70EEEE30]

And consumer will be simply as following:

[cid:image002.png@01D1F22A.70EEEE30]

o   When I increase the number of concurrent tasks, does it make multiple copies of buffer/channel reader etc.? Or is it only the processing which gets multiplied?

•         Get / Fetch File:

o   Can we assume that when this processor is running on multiple nodes and threads, the same file will never get pulled multiple times as a flow-file?

•         Distribute Load Processor:

o   When this processor is running on multiple nodes, will all the incoming flow files go to each instance of running node? And this question is for any processor that run on a cluster and has to consume an incoming flow-file? What’s the general routing strategy in NiFi when a processor is running on multiple node?

•         ExecuteSQL

o   Will all the running instances on all the nodes be hitting the RDBMS to fetch the data for the same query leading to duplicates, and heavy load on database?

Thanks,
Manish

Re: Processors in cluster mode

Posted by Bryan Bende <bb...@gmail.com>.

Hi Manish,

This post [1] has an overview of how to distribute data across your NiFi
cluster.

In general though, NiFi runs the same flow on each node and the data needs
to be divided across the nodes appropriately depending on the situation.
The only exception to running the same flow on every node is when a
processor is scheduled to run Primary Node only.

Concurrent Tasks is the number of threads that will concurrently call a
given instance of a processor. So if you have processor "Foo" and a three
node cluster, and set concurrent tasks to 2, there will be three instances
of Foo and each will have two threads calling the onTrigger method.

For some of your specific cases...

ListenTCP - You would have an instance of this process on all three nodes
and need the producer to send to all of them, or have a load balancer that
supports TCP sitting in front of the nodes and have the producer send to
the load balancer.
Get/Fetch File - These pick up files from the local filesystem so it would
be up to the data producer to send/write files on each node of the cluster
for each instance of this processor to pick up.
Distribute Load Processor - There will be a Distribute Load processor
running on each node and operating on only the flow files on that node.
ExecuteSQL - Typically you would run this on primary node only, or in an
upcoming release there is going to be some more options with a
ListDatabaseTable processor that can produce instructions than can be
distributed across a cluster to your ExecuteSQL processor.

Hope that helps.

Thanks,

Bryan

[1]
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

On Mon, Aug 8, 2016 at 7:55 AM, Manish Gupta 8 <mg...@sapient.com> wrote:

> Hi,
>
>
>
> I am running a multi-node NiFi (0.7.0) cluster and trying to implement a
> streaming ingestion pipeline (@ 200 MB/s at peak and around 30 MB/s at
> non-peak hours) and routing to different destinations (Kafka, Azure
> Storage, HDFS). The dataflow will be exposing a TCP port for incoming data
> and will also be ingesting files from folder, database records etc.
>
>
>
> *It would be great if someone can provide a link/doc that explains how
> processors can be expected to behave in a multi-node environment.*
>
> *My doubts are about how some of the processors work in a clustered mode,
> and the meaning of concurrent tasks. *
>
>
>
> For example:
>
>
>
> ·         *ListenTCP*:
>
> o   When this processor is scheduled to run on a cluster (and not on the
> primary node), then does it mean I need to send data to all the individual
> nodes manually i.e. specify each node’s host names separately? If I don’t
> specify hosts individually and only provide let’s say primary node’s host
> name from producer, will all the other nodes remain idle? Or NiFi tries to
> distribute the data to other nodes using some routing strategy? I am trying
> to increase the throughput and thinking something like this as data
> producer strategy:
>
>
>
>
>
> And consumer will be simply as following:
>
>
>
>
>
> o   When I increase the number of concurrent tasks, does it make multiple
> copies of buffer/channel reader etc.? Or is it only the processing which
> gets multiplied?
>
> ·         *Get / Fetch File*:
>
> o   Can we assume that when this processor is running on multiple nodes
> and threads, the same file will never get pulled multiple times as a
> flow-file?
>
> ·         *Distribute Load Processor*:
>
> o   When this processor is running on multiple nodes, will all the
> incoming flow files go to each instance of running node? And this question
> is for any processor that run on a cluster and has to consume an incoming
> flow-file? What’s the general routing strategy in NiFi when a processor is
> running on multiple node?
>
> ·         *ExecuteSQL*
>
> o   Will all the running instances on all the nodes be hitting the RDBMS
> to fetch the data for the same query leading to duplicates, and heavy load
> on database?
>
>
>
> Thanks,
>
> Manish
>
>
>