You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by David Early via users <us...@nifi.apache.org> on 2022/11/01 15:42:05 UTC

Content repository threshold issues

Hi all,

We have a 3 mode cluster processing a fairly average amount of data but we
are running into an issue with the content repository:

[image: image.png]
This is specifically because the content repo is over it's assigned
percentage.

The problem we are having is we aren't quite sure what to do about it.

Part of the issue is that the area we appear to have the problem in is a
flow that receives one large item that is subsequently split into anywhere
from 100-8000 individual flow files.

We suspect that part of the issue is that if ONE of those events from an
original object (that had say 500 items in it) is retained in a queue
somewhere for any reason, the ENTIRE content of the original object is
retained in the content repo.  In other words, even if I only retain one
flow file in a queue, all 500 are retained in the content repo.

We just deleted about 100M of older items that had accumulated in an user's
area and it removed 26G from the content repo (77G down to 51G).

That said, we feel a bit like if we increase the percentage for the content
repo, it just uses more regardless.  We tried moving the % to 70 from 50,
and it just seems to move to using just under 70% instead of just under 50%

And if it backs up, processing appears to STOP while it waits for cleanup,
which is not good.

Are we missing something?  Do we just need a bigger disk?

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=4 days
nifi.content.repository.archive.max.usage.percentage=70%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.remote.contents.cache.expiration=30 secs
nifi.web.max.content.size=
Dave

Re: Content repository threshold issues

Posted by Joe Witt <jo...@gmail.com>.
Dave

What version of NiFI are you using?

Change nifi.content.claim.max.appendable.size to 50KB then restart.

How large is your './content_repository' disk?  Also that path suggests
you're likely just sharing the same partition for content, flowfile, and
provenance data.  This is not recommended as it makes it very difficult for
nifi to manage space for each repository properly since the total size
available/used will be a combination of all three at least.

As far as the split logic - what processor does the split?  In some cases
things like SplitText for instance can do a split by merely marking offsets
of the original content instead of rewriting.  This is dramatically more
efficient but you're right that it could hold onto the original dataset
longer than anticipated if any of the content is still actively reachable.
Other split functions will rewrite content and thus behavior will differ to
be more like you'd expect at the notable tradeoff of performance.

Thanks

On Tue, Nov 1, 2022 at 8:42 AM David Early via users <us...@nifi.apache.org>
wrote:

> Hi all,
>
> We have a 3 mode cluster processing a fairly average amount of data but we
> are running into an issue with the content repository:
>
> [image: image.png]
> This is specifically because the content repo is over it's assigned
> percentage.
>
> The problem we are having is we aren't quite sure what to do about it.
>
> Part of the issue is that the area we appear to have the problem in is a
> flow that receives one large item that is subsequently split into anywhere
> from 100-8000 individual flow files.
>
> We suspect that part of the issue is that if ONE of those events from an
> original object (that had say 500 items in it) is retained in a queue
> somewhere for any reason, the ENTIRE content of the original object is
> retained in the content repo.  In other words, even if I only retain one
> flow file in a queue, all 500 are retained in the content repo.
>
> We just deleted about 100M of older items that had accumulated in an
> user's area and it removed 26G from the content repo (77G down to 51G).
>
> That said, we feel a bit like if we increase the percentage for the
> content repo, it just uses more regardless.  We tried moving the % to 70
> from 50, and it just seems to move to using just under 70% instead of just
> under 50%
>
> And if it backs up, processing appears to STOP while it waits for cleanup,
> which is not good.
>
> Are we missing something?  Do we just need a bigger disk?
>
>
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
> nifi.content.claim.max.appendable.size=1 MB
> nifi.content.repository.directory.default=./content_repository
> nifi.content.repository.archive.max.retention.period=4 days
> nifi.content.repository.archive.max.usage.percentage=70%
> nifi.content.repository.archive.enabled=true
> nifi.content.repository.always.sync=false
> nifi.content.viewer.url=../nifi-content-viewer/
> nifi.remote.contents.cache.expiration=30 secs
> nifi.web.max.content.size=
> Dave
>