You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Vijay Chhipa <vc...@apple.com> on 2021/12/20 17:55:36 UTC

Flowfile disk space is not released from the content-repository until the entire dataflow is completed

Hi all, 

We have a use case where we list out the contents of a website and then download each item in the list and process it. 
What I expected is that when each item (a file) is downloaded, after processing is completed, and the flowfile is not in any of the queues the disk storage will be released. But what I see is the content-repo size continues to increase as the files are processed. If I pause the flow for several hours (over 24 hours) the repo size stays at the increased level and does not go down. Only when I clear all the queues does the content-repo size goes down to the original size (before the flow started). 

I am not using provenance and have disabled it. 
Here is the relevant section of the properties file. 

I would have been okay with it but I need to process over 200K files each in size almost 1GB.   

What is holding reference to these processed flow files and how can I design the dataflow to not have the content repo filled up. 

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=/var/foo/bar/flowfile_repository
nifi.flowfile.repository.partitions=256
nifi.flowfile.repository.checkpoint.interval=2 mins
nifi.flowfile.repository.always.sync=false

# Content Repository
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=10
nifi.content.repository.directory.default=/var/foo/bar/content_repository
nifi.content.repository.archive.max.retention.period=6 hours
nifi.content.repository.archive.max.usage.percentage=40%
nifi.content.repository.archive.enabled=false
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/

# Provenance Repository Properties
nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id=
nifi.provenance.repository.encryption.key=

# Persistent Provenance Repository Properties
nifi.provenance.repository.directory.default=/var/foo/bar/provenance_repository
nifi.provenance.repository.max.storage.time=24 hours
nifi.provenance.repository.max.storage.size=1 GB
nifi.provenance.repository.rollover.time=30 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false


nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship

nifi.provenance.repository.indexed.attributes=

nifi.provenance.repository.index.shard.size=500 MB
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2

nifi.provenance.repository.warm.cache.frequency=1 hour
nifi.provenance.repository.buffer.size=100000

Thanks
Vijay

Re: Flowfile disk space is not released from the content-repository until the entire dataflow is completed

Posted by Joe Witt <jo...@gmail.com>.
Vijay

nifi.content.repository.archive.max.retention.period=6 hours
nifi.content.repository.archive.max.usage.percentage=40%

Did you actually run out of disk space?  What error did you get?

We do remove content from the flow file repository when there is no
longer an active flow file that points at that version of content AND
when we need to free up space.

What version are you using?

Thanks

On Mon, Dec 20, 2021 at 10:55 AM Vijay Chhipa <vc...@apple.com> wrote:
>
> Hi all,
>
> We have a use case where we list out the contents of a website and then download each item in the list and process it.
> What I expected is that when each item (a file) is downloaded, after processing is completed, and the flowfile is not in any of the queues the disk storage will be released. But what I see is the content-repo size continues to increase as the files are processed. If I pause the flow for several hours (over 24 hours) the repo size stays at the increased level and does not go down. Only when I clear all the queues does the content-repo size goes down to the original size (before the flow started).
>
> I am not using provenance and have disabled it.
> Here is the relevant section of the properties file.
>
> I would have been okay with it but I need to process over 200K files each in size almost 1GB.
>
> What is holding reference to these processed flow files and how can I design the dataflow to not have the content repo filled up.
>
> nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
> nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
> nifi.flowfile.repository.directory=/var/foo/bar/flowfile_repository
> nifi.flowfile.repository.partitions=256
> nifi.flowfile.repository.checkpoint.interval=2 mins
> nifi.flowfile.repository.always.sync=false
>
> # Content Repository
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
> nifi.content.claim.max.appendable.size=1 MB
> nifi.content.claim.max.flow.files=10
> nifi.content.repository.directory.default=/var/foo/bar/content_repository
> nifi.content.repository.archive.max.retention.period=6 hours
> nifi.content.repository.archive.max.usage.percentage=40%
> nifi.content.repository.archive.enabled=false
> nifi.content.repository.always.sync=false
> nifi.content.viewer.url=../nifi-content-viewer/
>
> # Provenance Repository Properties
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
> nifi.provenance.repository.debug.frequency=1_000_000
> nifi.provenance.repository.encryption.key.provider.implementation=
> nifi.provenance.repository.encryption.key.provider.location=
> nifi.provenance.repository.encryption.key.id=
> nifi.provenance.repository.encryption.key=
>
> # Persistent Provenance Repository Properties
> nifi.provenance.repository.directory.default=/var/foo/bar/provenance_repository
> nifi.provenance.repository.max.storage.time=24 hours
> nifi.provenance.repository.max.storage.size=1 GB
> nifi.provenance.repository.rollover.time=30 secs
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=2
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
>
>
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship
>
> nifi.provenance.repository.indexed.attributes=
>
> nifi.provenance.repository.index.shard.size=500 MB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=2
>
> nifi.provenance.repository.warm.cache.frequency=1 hour
> nifi.provenance.repository.buffer.size=100000
>
> Thanks
> Vijay