You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Isha Lamboo <is...@virtualsciences.nl> on 2023/01/27 10:15:02 UTC

How do you use container-based NiFi ?

Hi all,

I’m looking for some perspectives from people using NiFi deployed in containers (Docker or otherwise).

It seems to me that the NiFi architecture benefits from having a lot of compute resources to share for all flows, especially with large batches arriving periodically. On the other hand, it’s hard to prevent badly tuned flows from impacting others and more and more IT operations are moving to containerized environments, so I’m exploring the options for containerized NiFi as an alternative to our current VM-based approach.

Do you deploy a few large containers similar in capacity to a VM to run all flows together or many small ones with only a few flows on each? And do you deploy them clustered or standalone?

Thanks,

Isha

Re: How do you use container-based NiFi ?

Posted by David Snyder via users <us...@nifi.apache.org>.
What a substantive response!  Thank you for this info/your perspective.

  

VR,

  

Dave

  

Sent from my iPhone

  

> On Feb 8, 2023, at 5:17 PM, Adam Taft <ad...@adamtaft.com> wrote:  
>  
>

> 
>
> Isha,
>
>  
>
>
> Just some perspective from the field. I have had success with containerized
> NiFi and generally get along with it. That being said, I think there a few
> caveats and issues you might find going down this road.
>
>  
>
>
> Standalone NiFi in a container works pretty much the way you would want and
> expect. You do need to be careful about where you are mounting your NiFi
> configuration directories, though. e.g. content_repository,
> database_repository, flowfile_repository, provenance_repository, state, logs
> and work. All of these directories are actively written by NiFi and it's
> good to have these exported as bind mounts external from the container.  
>
>
>  
>
>
> You will definitely want to bind mount the flow.xml.gz and flow.json.gz
> files as well, or you will lose your live dataflow configuration changes as
> you use NiFi. Any change to your nifi canvas gets written into flow.xml.gz,
> which means you need to keep a copy of it outside of your container. And
> there's potentially other files in the conf folder that you also want to
> keep around. NiFi unfortunately doesn't organize the location of all these
> directories into a single location by default, so you kind of have to
> reconfigure and/or bind mount a lot of different paths.  
>
>
>  
>
>
> I have found that NiFi clustering with a dockerized environment to be less
> desirable. Primarily the problem is that the definition of cluster nodes is
> mostly hard coded into the nifi.properties file. Usually in a containerized
> environment, you want the ability to dynamically bring nodes up/down as
> needed (with dynamic IP/network configuration), especially in container
> orchestration frameworks like kubernetes. There's been a lot of experiments
> and possibly even some reasonable solutions coming out to help with
> containerized clusters, but generally you're going to find you have to crack
> your knuckles a little bit to get this to work. If you're content with a
> mostly statically defined non-elastic cluster configuration, then a
> clustered NiFi on docker is possible.  
>
>
>  
>
>
> As an option, if you stick with standalone deployments, what you can instead
> do instead is front your individual NiFi node instances with a load
> balancer. This may be a poor-man's approach to load distribution, but it
> works reasonably well and I've seen it in action on large volume flows. If
> you have the option that your data source can deliver to a load balancer,
> then you can have the load balancer round-robin (or similar) to your
> underlying standalone nodes. In a container orchestration environment, you
> can imagine kubernetes being able to spin up and spin down containerized
> nodes to handle demand, and managing a load balancer configuration as those
> nodes are coming up. It's all possible, but will require some work.
>
>  
>
>
> Of course, doing anything with multiple standalone nodes, means that you
> have to propagate changes from one NiFi canvas to all your nodes manually.
> This is a huge pain and not really scalable. So the load balancer approach
> is only good if your dataflow configurations are very static and don't
> change day-to-day with operations.  
>
>
>  
>
>
> That is, one of the issues with containerized NiFi is what to do with the
> flow configuration itself. On the one hand, you kind of want to "burn in"
> your flow configuration into your docker image. e.g. the flow.xml.gz and/or
> flow.json.gz would be included as part of your image itself. This enables
> your NiFi system to come up with a fully configured set of processors ready
> to accept connections.
>
>  
>
>
> But part of the fun with NiFi is being able to make dataflow and processor
> configuration changes on the fly as needed based on operational conditions.
> For example, maybe you need to temporarily stop data moving to one location
> and have it transported to another. This "live" and dynamic way to manage
> NiFi is a powerful feature, but it kind of goes against the grain of a
> containerized or static deployment approach. e.g. new nodes coming online
> will not necessarily have the latest configuration changes that your
> operational staff has added recently. The NiFi registry can somewhat help
> here.  
>
>
>  
>
>
> Finally to give a shout out, you may want to consider using a dockerized
> minifi cluster instead of traditional NiFi. Minifi is maybe slightly more
> aligned with a containerized clustering approach as Minifi more directly
> supports this concept of a "burned in" processor configuration. In this way,
> Minifi nodes can be spun up or down based on demand, without too much
> fuss.e.g. minifi isn't really cluster aware and each node acts
> independently, making it a bit easier solution for containerized or dynamic
> deployments.  
>
>
>  
>
>
> Hope this gives you some thoughts. There are definitely a lot of recipes and
> approaches to containerized NiFi, so do some searching to find one that
> matches what you're after. Almost any configuration can be done, based on
> your needs.
>
>  
>
>
> /Adam
>
>  
>
>
>  
>
>
>  
>
>
> On Fri, Jan 27, 2023 at 3:15 AM Isha Lamboo
> <[isha.lamboo@virtualsciences.nl](mailto:isha.lamboo@virtualsciences.nl)>
> wrote:  
>
>

>> Hi all, __ __

>>

>> __  __

>>

>> I’m looking for some perspectives from people using NiFi deployed in
containers (Docker or otherwise). ____

>>

>> __  __

>>

>> It seems to me that the NiFi architecture benefits from having a lot of
compute resources to share for all flows, especially with large batches
arriving periodically. On the other hand, it’s hard to prevent badly tuned
flows from impacting others and more and more IT operations are moving to
containerized environments, so I’m exploring the options for containerized
NiFi as an alternative to our current VM-based approach. __ __

>>

>>  
>  Do you deploy a few large containers similar in capacity to a VM to run all
> flows together or many small ones with only a few flows on each? And do you
> deploy them clustered or standalone? __ __
>>

>> __  __

>>

>> Thanks, __ __

>>

>> __  __

>>

>> Isha __ __


RE: How do you use container-based NiFi ?

Posted by Isha Lamboo <is...@virtualsciences.nl>.
Hi Adam,

Thank you very much! This is exactly the kind of context and experience I was hoping for.

The scenarios you describe are what has/had me stumped. Most NiFi deployments I manage have a mix of fairly static high-volume sensor data flows and rapidly developing data transformation flows. It would make sense to split those up into a scaling set of standalone NiFi or Minifi containers and a fixed cluster for the developers to work their transformation magic, but that drastically increases the complexity compared to the current approach of generously sized VM clusters even if it could save on resources used.

Also, thanks for bringing up Minifi. We’ve abandoned it some years ago when it fell behind the NiFi versions too far to be easily managed, but with it being rolled into the NiFi codebase and a really good fit for containers I should give it another try.

Kind regards,

Isha

Van: Adam Taft <ad...@adamtaft.com>
Verzonden: woensdag 8 februari 2023 23:17
Aan: users@nifi.apache.org
Onderwerp: Re: How do you use container-based NiFi ?

Isha,

Just some perspective from the field. I have had success with containerized NiFi and generally get along with it. That being said, I think there a few caveats and issues you might find going down this road.

Standalone NiFi in a container works pretty much the way you would want and expect. You do need to be careful about where you are mounting your NiFi configuration directories, though. e.g. content_repository, database_repository, flowfile_repository, provenance_repository, state, logs and work. All of these directories are actively written by NiFi and it's good to have these exported as bind mounts external from the container.

You will definitely want to bind mount the flow.xml.gz and flow.json.gz files as well, or you will lose your live dataflow configuration changes as you use NiFi. Any change to your nifi canvas gets written into flow.xml.gz, which means you need to keep a copy of it outside of your container. And there's potentially other files in the conf folder that you also want to keep around. NiFi unfortunately doesn't organize the location of all these directories into a single location by default, so you kind of have to reconfigure and/or bind mount a lot of different paths.

I have found that NiFi clustering with a dockerized environment to be less desirable. Primarily the problem is that the definition of cluster nodes is mostly hard coded into the nifi.properties file. Usually in a containerized environment, you want the ability to dynamically bring nodes up/down as needed (with dynamic IP/network configuration), especially in container orchestration frameworks like kubernetes. There's been a lot of experiments and possibly even some reasonable solutions coming out to help with containerized clusters, but generally you're going to find you have to crack your knuckles a little bit to get this to work. If you're content with a mostly statically defined non-elastic cluster configuration, then a clustered NiFi on docker is possible.

As an option, if you stick with standalone deployments, what you can instead do instead is front your individual NiFi node instances with a load balancer. This may be a poor-man's approach to load distribution, but it works reasonably well and I've seen it in action on large volume flows. If you have the option that your data source can deliver to a load balancer, then you can have the load balancer round-robin (or similar) to your underlying standalone nodes. In a container orchestration environment, you can imagine kubernetes being able to spin up and spin down containerized nodes to handle demand, and managing a load balancer configuration as those nodes are coming up. It's all possible, but will require some work.

Of course, doing anything with multiple standalone nodes, means that you have to propagate changes from one NiFi canvas to all your nodes manually. This is a huge pain and not really scalable. So the load balancer approach is only good if your dataflow configurations are very static and don't change day-to-day with operations.

That is, one of the issues with containerized NiFi is what to do with the flow configuration itself. On the one hand, you kind of want to "burn in" your flow configuration into your docker image. e.g. the flow.xml.gz and/or flow.json.gz would be included as part of your image itself. This enables your NiFi system to come up with a fully configured set of processors ready to accept connections.

But part of the fun with NiFi is being able to make dataflow and processor configuration changes on the fly as needed based on operational conditions. For example, maybe you need to temporarily stop data moving to one location and have it transported to another. This "live" and dynamic way to manage NiFi is a powerful feature, but it kind of goes against the grain of a containerized or static deployment approach. e.g. new nodes coming online will not necessarily have the latest configuration changes that your operational staff has added recently. The NiFi registry can somewhat help here.

Finally to give a shout out, you may want to consider using a dockerized minifi cluster instead of traditional NiFi. Minifi is maybe slightly more aligned with a containerized clustering approach as Minifi more directly supports this concept of a "burned in" processor configuration. In this way, Minifi nodes can be spun up or down based on demand, without too much fuss.e.g. minifi isn't really cluster aware and each node acts independently, making it a bit easier solution for containerized or dynamic deployments.

Hope this gives you some thoughts. There are definitely a lot of recipes and approaches to containerized NiFi, so do some searching to find one that matches what you're after. Almost any configuration can be done, based on your needs.

/Adam



On Fri, Jan 27, 2023 at 3:15 AM Isha Lamboo <is...@virtualsciences.nl>> wrote:
Hi all,

I’m looking for some perspectives from people using NiFi deployed in containers (Docker or otherwise).

It seems to me that the NiFi architecture benefits from having a lot of compute resources to share for all flows, especially with large batches arriving periodically. On the other hand, it’s hard to prevent badly tuned flows from impacting others and more and more IT operations are moving to containerized environments, so I’m exploring the options for containerized NiFi as an alternative to our current VM-based approach.

Do you deploy a few large containers similar in capacity to a VM to run all flows together or many small ones with only a few flows on each? And do you deploy them clustered or standalone?

Thanks,

Isha

Re: How do you use container-based NiFi ?

Posted by Jeremy Pemberton-Pigott <fu...@gmail.com>.
Hi Isha,

I agree with what Adam has said.

Our setups are small 4-16 node clusters, we run multiple single large flows
(real-time and batch flows).  For our environment we use OKD (OpenShift)
for container orchestration and to provide the routes/load balancing
through a service to the NiFi cluster.  We bind the conf, work, state,
logs, and 4 repos to local disks on each server.  The yaml files that
launch the containers are parameterized with variables and a customized
nifi start script (built into the NiFi image) that modifies the necessary
properties files so that we can keep the setup dynamic at the initial
cluster creation.  Nothing particularly complex and our clusters are very
compact working in both cloud and on-premise deployments.  The launch
script copies some necessary dynamically changing files to the target
servers, hbase site configs, truststores, and so on.

The hitch with the flow is that it has to be copied out from the
containers/off the running servers back to the launching server, in our
setup, to be reconciled (typically by file date since some containers can
be lost and no longer syncing), this is done when you tear down the NiFi
cluster or after recovering lost nodes.  Incremental changes are kept in
the NiFi Registry and images are updated to keep things synchronized.  Our
image is rebuilt with the latest updated flow only, that provides the
initial flow at launch, while the images are pulled automatically by the
running server on launch of the container (handled by OKD).

We don't dynamically scale the cluster after it is started, the number of
nodes is chosen before launching.  We run the latest 1.19.1 in this
environment and have dozens of standalone NiFi instances pushing large
volumes of data that are collecting from 100s of PCs and servers, that is
large for our use case (100Ks of logs, 10s of GBs/day), containing sensor
data and logs that are transformed and published to Kafka and HBase for
analysis by Spark streaming and Spark batch jobs.  This is our real-time
analytics flow, we have another NiFi cluster running in the same grouping
of servers for large batch processing (pulling data from S3 or other disk
storage systems) to avoid a big backlog on the real-time side of things,
both are pushing data to the same backend services also running in the same
cluster.

We've been scaling the system since v1.1.0 of NiFi and our setup is a fully
automated deployment with the gotchas previously mentioned.

On Thu, Feb 9, 2023 at 6:17 AM Adam Taft <ad...@adamtaft.com> wrote:

> Isha,
>
> Just some perspective from the field. I have had success with
> containerized NiFi and generally get along with it. That being said, I
> think there a few caveats and issues you might find going down this road.
>
> Standalone NiFi in a container works pretty much the way you would want
> and expect. You do need to be careful about where you are mounting your
> NiFi configuration directories, though. e.g. content_repository,
> database_repository, flowfile_repository, provenance_repository, state,
> logs and work. All of these directories are actively written by NiFi and
> it's good to have these exported as bind mounts external from the container.
>
> You will definitely want to bind mount the flow.xml.gz and flow.json.gz
> files as well, or you will lose your live dataflow configuration changes as
> you use NiFi. Any change to your nifi canvas gets written into flow.xml.gz,
> which means you need to keep a copy of it outside of your container. And
> there's potentially other files in the conf folder that you also want to
> keep around. NiFi unfortunately doesn't organize the location of all these
> directories into a single location by default, so you kind of have to
> reconfigure and/or bind mount a lot of different paths.
>
> I have found that NiFi clustering with a dockerized environment to be less
> desirable. Primarily the problem is that the definition of cluster nodes is
> mostly hard coded into the nifi.properties file. Usually in a containerized
> environment, you want the ability to dynamically bring nodes up/down as
> needed (with dynamic IP/network configuration), especially in container
> orchestration frameworks like kubernetes. There's been a lot of experiments
> and possibly even some reasonable solutions coming out to help with
> containerized clusters, but generally you're going to find you have to
> crack your knuckles a little bit to get this to work. If you're content
> with a mostly statically defined non-elastic cluster configuration, then a
> clustered NiFi on docker is possible.
>
> As an option, if you stick with standalone deployments, what you can
> instead do instead is front your individual NiFi node instances with a load
> balancer. This may be a poor-man's approach to load distribution, but it
> works reasonably well and I've seen it in action on large volume flows. If
> you have the option that your data source can deliver to a load balancer,
> then you can have the load balancer round-robin (or similar) to your
> underlying standalone nodes. In a container orchestration environment, you
> can imagine kubernetes being able to spin up and spin down containerized
> nodes to handle demand, and managing a load balancer configuration as those
> nodes are coming up. It's all possible, but will require some work.
>
> Of course, doing anything with multiple standalone nodes, means that you
> have to propagate changes from one NiFi canvas to all your nodes manually.
> This is a huge pain and not really scalable. So the load balancer approach
> is only good if your dataflow configurations are very static and don't
> change day-to-day with operations.
>
> That is, one of the issues with containerized NiFi is what to do with the
> flow configuration itself. On the one hand, you kind of want to "burn in"
> your flow configuration into your docker image. e.g. the flow.xml.gz and/or
> flow.json.gz would be included as part of your image itself. This enables
> your NiFi system to come up with a fully configured set of processors ready
> to accept connections.
>
> But part of the fun with NiFi is being able to make dataflow and processor
> configuration changes on the fly as needed based on operational conditions.
> For example, maybe you need to temporarily stop data moving to one location
> and have it transported to another. This "live" and dynamic way to manage
> NiFi is a powerful feature, but it kind of goes against the grain of a
> containerized or static deployment approach. e.g. new nodes coming online
> will not necessarily have the latest configuration changes that your
> operational staff has added recently. The NiFi registry can somewhat help
> here.
>
> Finally to give a shout out, you may want to consider using a dockerized
> minifi cluster instead of traditional NiFi. Minifi is maybe slightly more
> aligned with a containerized clustering approach as Minifi more directly
> supports this concept of a "burned in" processor configuration. In this
> way, Minifi nodes can be spun up or down based on demand, without too much
> fuss.e.g. minifi isn't really cluster aware and each node acts
> independently, making it a bit easier solution for containerized or dynamic
> deployments.
>
> Hope this gives you some thoughts. There are definitely a lot of recipes
> and approaches to containerized NiFi, so do some searching to find one that
> matches what you're after. Almost any configuration can be done, based on
> your needs.
>
> /Adam
>
>
>
> On Fri, Jan 27, 2023 at 3:15 AM Isha Lamboo <
> isha.lamboo@virtualsciences.nl> wrote:
>
>> Hi all,
>>
>>
>>
>> I’m looking for some perspectives from people using NiFi deployed in
>> containers (Docker or otherwise).
>>
>>
>>
>> It seems to me that the NiFi architecture benefits from having a lot of
>> compute resources to share for all flows, especially with large batches
>> arriving periodically. On the other hand, it’s hard to prevent badly tuned
>> flows from impacting others and more and more IT operations are moving to
>> containerized environments, so I’m exploring the options for containerized
>> NiFi as an alternative to our current VM-based approach.
>>
>>
>> Do you deploy a few large containers similar in capacity to a VM to run
>> all flows together or many small ones with only a few flows on each? And do
>> you deploy them clustered or standalone?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Isha
>>
>

Re: How do you use container-based NiFi ?

Posted by Adam Taft <ad...@adamtaft.com>.
Isha,

Just some perspective from the field. I have had success with containerized
NiFi and generally get along with it. That being said, I think there a few
caveats and issues you might find going down this road.

Standalone NiFi in a container works pretty much the way you would want and
expect. You do need to be careful about where you are mounting your NiFi
configuration directories, though. e.g. content_repository,
database_repository, flowfile_repository, provenance_repository, state,
logs and work. All of these directories are actively written by NiFi and
it's good to have these exported as bind mounts external from the container.

You will definitely want to bind mount the flow.xml.gz and flow.json.gz
files as well, or you will lose your live dataflow configuration changes as
you use NiFi. Any change to your nifi canvas gets written into flow.xml.gz,
which means you need to keep a copy of it outside of your container. And
there's potentially other files in the conf folder that you also want to
keep around. NiFi unfortunately doesn't organize the location of all these
directories into a single location by default, so you kind of have to
reconfigure and/or bind mount a lot of different paths.

I have found that NiFi clustering with a dockerized environment to be less
desirable. Primarily the problem is that the definition of cluster nodes is
mostly hard coded into the nifi.properties file. Usually in a containerized
environment, you want the ability to dynamically bring nodes up/down as
needed (with dynamic IP/network configuration), especially in container
orchestration frameworks like kubernetes. There's been a lot of experiments
and possibly even some reasonable solutions coming out to help with
containerized clusters, but generally you're going to find you have to
crack your knuckles a little bit to get this to work. If you're content
with a mostly statically defined non-elastic cluster configuration, then a
clustered NiFi on docker is possible.

As an option, if you stick with standalone deployments, what you can
instead do instead is front your individual NiFi node instances with a load
balancer. This may be a poor-man's approach to load distribution, but it
works reasonably well and I've seen it in action on large volume flows. If
you have the option that your data source can deliver to a load balancer,
then you can have the load balancer round-robin (or similar) to your
underlying standalone nodes. In a container orchestration environment, you
can imagine kubernetes being able to spin up and spin down containerized
nodes to handle demand, and managing a load balancer configuration as those
nodes are coming up. It's all possible, but will require some work.

Of course, doing anything with multiple standalone nodes, means that you
have to propagate changes from one NiFi canvas to all your nodes manually.
This is a huge pain and not really scalable. So the load balancer approach
is only good if your dataflow configurations are very static and don't
change day-to-day with operations.

That is, one of the issues with containerized NiFi is what to do with the
flow configuration itself. On the one hand, you kind of want to "burn in"
your flow configuration into your docker image. e.g. the flow.xml.gz and/or
flow.json.gz would be included as part of your image itself. This enables
your NiFi system to come up with a fully configured set of processors ready
to accept connections.

But part of the fun with NiFi is being able to make dataflow and processor
configuration changes on the fly as needed based on operational conditions.
For example, maybe you need to temporarily stop data moving to one location
and have it transported to another. This "live" and dynamic way to manage
NiFi is a powerful feature, but it kind of goes against the grain of a
containerized or static deployment approach. e.g. new nodes coming online
will not necessarily have the latest configuration changes that your
operational staff has added recently. The NiFi registry can somewhat help
here.

Finally to give a shout out, you may want to consider using a dockerized
minifi cluster instead of traditional NiFi. Minifi is maybe slightly more
aligned with a containerized clustering approach as Minifi more directly
supports this concept of a "burned in" processor configuration. In this
way, Minifi nodes can be spun up or down based on demand, without too much
fuss.e.g. minifi isn't really cluster aware and each node acts
independently, making it a bit easier solution for containerized or dynamic
deployments.

Hope this gives you some thoughts. There are definitely a lot of recipes
and approaches to containerized NiFi, so do some searching to find one that
matches what you're after. Almost any configuration can be done, based on
your needs.

/Adam



On Fri, Jan 27, 2023 at 3:15 AM Isha Lamboo <is...@virtualsciences.nl>
wrote:

> Hi all,
>
>
>
> I’m looking for some perspectives from people using NiFi deployed in
> containers (Docker or otherwise).
>
>
>
> It seems to me that the NiFi architecture benefits from having a lot of
> compute resources to share for all flows, especially with large batches
> arriving periodically. On the other hand, it’s hard to prevent badly tuned
> flows from impacting others and more and more IT operations are moving to
> containerized environments, so I’m exploring the options for containerized
> NiFi as an alternative to our current VM-based approach.
>
>
> Do you deploy a few large containers similar in capacity to a VM to run
> all flows together or many small ones with only a few flows on each? And do
> you deploy them clustered or standalone?
>
>
>
> Thanks,
>
>
>
> Isha
>