You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Purushotham Pushpavanthar <pu...@gmail.com> on 2019/08/04 20:00:58 UTC

Implementation of FilesystemComponentStatusRepository

Hi,

Right now we only have volatile implementation of ComponentStatusRepository
which serves status report of each component. Though we have reporting
tasks like AmbariReportingTask, PrometheusReportingTasks which push metrics
to external persistent systems, NiFi's Component Status History serves as
primary source of truth. We might have to fall back to this if there is
some issue with external metrics collecting system.
Since the only available implementation is
VolatileComponentStatusRepository, this data is lost if node is restarted.
I feel that filesystem persistent implementation should be available for
users to choose from. Is there anything lined up on this regard?


Regards,
Purushotham Pushpavanth

Re: Implementation of FilesystemComponentStatusRepository

Posted by Mark Payne <ma...@hotmail.com>.

Purushotham,

I'm not aware of anyone working on a file-based implementation, but it would certainly be a welcome feature if anyone does take the initiative to develop it. In addition to keeping stats for much longer, it also means that we could avoid holding all of these stats in memory, as they can become quite large when you have a large graph of processors & connections. Also, a file-based implementation would mean that we can span restarts, and this can get very interesting if we were to have other events persisted, such as node restarts, change in number of disks used for repositories, versions of the software, etc., as plotting these on top of the metrics would provide a lot of insights as to what caused specific differences in the metrics.

That being said, implementing such a repository well may not be as straight-forward as it seems. We typically capture the metrics every minute, or every 5 minutes, or whatever is configured. For seeing how things are going on a short-term basis, especially shortly after a restart, the 1-minute captures are very helpful. But we often use 5 minutes because it uses only 20% as much memory for the same time frame. But if we capture these metrics periodically, the most intuitive way to write them out would be in a row-oriented stored, using perhaps JSON, Avro, or CSV. This is easy enough. But when they are queried, they are queried by component ID, so reading a huge amount of data to render that is sub-optimal, so for querying you really want a column- or block-oriented storage format. Perhaps a database like H2 would make the most sense. Or a time-series database of some sort.

It may also be important to "roll up" metrics. We capture about 12 metrics per component. For a fairly large graph, we often seen 10,000 components or more (these include Processors, Ports, Process Groups, Connections, Remote Process Groups, etc.) So that would be a total of about 120,000 metrics recorded for each "snapshot." If we assume 8 bytes per metric (for a 64-bit long value) we're now talking about 960 KB to store each snapshot. If that's done on a 1-minute boundary for 1 month we'd be at more than 41 GB just to store this without any overhead, etc. So we'd have to either capture metrics less frequently (which is less than ideal for determining how things have been processing very recently, over the last hour or so), have huge storage overhead, or "roll up" the metrics so that older metrics get averaged, etc.

Which was all a really long-winded way of saying I think it would be awesome to have - but there are a lot of design trade-offs that would have to be considered and it would be far from being a trivial task.

Thanks
-Mark

> On Aug 4, 2019, at 4:00 PM, Purushotham Pushpavanthar <pu...@gmail.com> wrote:
> 
> Hi,
> 
> Right now we only have volatile implementation of ComponentStatusRepository
> which serves status report of each component. Though we have reporting
> tasks like AmbariReportingTask, PrometheusReportingTasks which push metrics
> to external persistent systems, NiFi's Component Status History serves as
> primary source of truth. We might have to fall back to this if there is
> some issue with external metrics collecting system.
> Since the only available implementation is
> VolatileComponentStatusRepository, this data is lost if node is restarted.
> I feel that filesystem persistent implementation should be available for
> users to choose from. Is there anything lined up on this regard?
> 
> 
> Regards,
> Purushotham Pushpavanth