You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Ryan Hendrickson <ry...@gmail.com> on 2020/09/17 14:58:37 UTC

Content Claims Filling Disk - Best practice for small files?

Hello,
I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB of
data on my canvas.

However, the content repository (on it's own partition) is completely full
with 350GB of data.  I'm pretty certain the way Content Claims store the
data is responsible for this.  In previous experience, we've had files that
are larger, and haven't seen this as much.

My guess is that as data was streaming through and being added to a claim,
it isn't always released as the small files leaves the canvas.

We've run into this issue enough times that I figure there's probably a
"best practice for small files" for the content claims settings.

These are our current settings:
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=/var/nifi/repositories/content
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository


There's 1024 folders on the disk (0-1023) for the Content Claims.
Each file inside the folders are roughly  2MB to 8 MB (Which is odd because
I thought the max appendable size would make this no larger than 1MB.)

Is there a way to expand the number of folders and/or reduce the amount of
individual FlowFiles that are stored in the claims?

I'm hoping there might be a best practice out there though.

Thanks,
Ryan

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Mark Payne <ma...@hotmail.com>.

Ryan,

Thanks. So 1.12.0 has no known issues with content repo not being cleaned up properly.

As you pointed out, nifi.content.claim.max.appendable.size is intended to cap the maximum number of FlowFiles that will be written to a single file. However, it does come with a couple of caveats.

(1) Once this cap is reached, it won’t keep adding FlowFiles to the stream, but once it starts, it doesn’t spill over to another stream. So, with that set to 1 MB, you may write 100 FlowFiles, each 4 KB, and then write a 4 MB FlowFile to it. So the size will be about 4.4 MB, and it won’t be cleaned up until all 101 FlowFiles have left your system.

(2) The cap only takes effect between Process Sessions. Meaning, that if you have a Processor that processes many FlowFiles in a single session, they can all be written to a single file. Generally, this could happen if you set the Run Duration to a high value. For example, if Run Duration is set to 1 second and you have enough FlowFiles for it to process for a full second, all of those FlowFiles could be written to the same file on disk.

Also, of note, the files are only cleaned up when the FlowFile Repository checkpoints. This is determined by the “nifi.flowfile.repository.checkpoint.interval” property. This defaults to 20 seconds in 1.12.0 but if you have a larger value there, you may want to decrease it.

One thing that might be of interest in understanding why the content claims still exist in the repo is to run “bin/nifi.sh diagnostics —verbose diagnostics1.txt”
That will write out a file, diagnostics1.txt, that has lots of diagnostics information. This includes which FlowFiles are referencing each file in the content repository. I.e., which FlowFiles must finish processing before the file can be cleaned up.

Hope this helps!
-Mark

On Sep 17, 2020, at 11:07 AM, Ryan Hendrickson <ry...@gmail.com>> wrote:

1.12.0

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com>> wrote:
Ryan

What version are you using? I do think we had an issue that kept items around longer than intended that has been addressed.

Thanks

On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <ry...@gmail.com>> wrote:
Hello,
I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB of data on my canvas.

However, the content repository (on it's own partition) is completely full with 350GB of data. I'm pretty certain the way Content Claims store the data is responsible for this. In previous experience, we've had files that are larger, and haven't seen this as much.

My guess is that as data was streaming through and being added to a claim, it isn't always released as the small files leaves the canvas.

We've run into this issue enough times that I figure there's probably a "best practice for small files" for the content claims settings.

These are our current settings:
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=/var/nifi/repositories/content
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository

There's 1024 folders on the disk (0-1023) for the Content Claims.
Each file inside the folders are roughly 2MB to 8 MB (Which is odd because I thought the max appendable size would make this no larger than 1MB.)

Is there a way to expand the number of folders and/or reduce the amount of individual FlowFiles that are stored in the claims?

I'm hoping there might be a best practice out there though.

Thanks,
Ryan

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Ryan Hendrickson <ry...@gmail.com>.

Mark,
To close the loop on "max.appendable.claim.size", I confirmed it was
removed in 1.12.0, we just didn't remove it from our nifi.properties file.
I've done that now..

For the server with the disk full (/var/nifi/repositories/content and
/var/nifi/repositories/flowfile are both full).. there's lots of content
claims that exist between 5 minute intervals - in fact tons.  The file is
14MB large.
default/483/1600318340509-49635, Claimant Count =0, In Use = false,
Awaiting Destruction = false, References (0) = []

Doing a cat diag | grep " Claimant Count =0, In Use = false, Awaiting
Destruction = false, References (0)" | wc -l  yields
39,219 matching lines..

Could too high of CPU load cause the claims not to be cleaned-up?  Is there
a way to kick-off a manual clean-up?


I've checked on another server with the same setup too, this one doesn't
have that issue:

As for looking a few stats a few minutes apart:
Diag1:
Queued FlowFiles: 57
Queued Bytes: 220412760
Running components: 10

Diag2:
Queued FlowFiles: 4070
Queued Bytes: 285217783
Running components: 11

The Claimant Counts are cleaning up well here.

Thanks,
Ryan

On Thu, Sep 17, 2020 at 3:45 PM Mark Payne <ma...@hotmail.com> wrote:

> Ryan,
>
> OK, thanks. So the “100 based on the max size” is… “fun.” Not entirely
> sure when that property made it into nifi.properties - I’m guessing that
> when the max.appendable.claim.size was added, we intended to also implement
> a max number of FlowFiles. But it was never implemented. So I think this
> probably has never been used and it was just a bug that it ever made it
> into nifi.properties. I think that was actually cleaned up in 1.12.0.
>
> What will be interesting is if you wait, say 5 minutes, and then run the
> diagnostics dump again. Are there any files that previously had a Claimant
> Count of 0 and that had In Use = false, that still exist with a Claimant
> Count of 0? If not, then at least we know that cleanup is working properly.
> If there are, then that would indicate that the content repo is not
> cleaning up properly.
>
>
>
> On Sep 17, 2020, at 3:38 PM, Ryan Hendrickson <
> ryan.andrew.hendrickson@gmail.com> wrote:
>
> A couple things from it:
>
> 1. The sum of the "Claimant counts" equals the number of FlowFiles
> reported on the Canvas.
> 2. None are Awaiting Destruction
> 3. Claimant Count Lowest number is 1 (when it's not zero)
> 4. Claimant Count Highest number is 4,773  (Should this one be 100 based
> on the max size, but maybe not if more than 100 is read in a single
> session?)
> 5. The sum of the "References" is 64,814.
> 6. The lowest Reference is 1 (when it's not zero)
> 7. The highest Reference is 4,773
> 8. Some References have Swap Files (10,006) and others have FlowFiles (470)
> 9. There are 10,468 "In Use"
>
> Anything there stick out to anyone?
>
> Thanks,
> Ryan
>
> On Thu, Sep 17, 2020 at 2:29 PM Ryan Hendrickson <
> ryan.andrew.hendrickson@gmail.com> wrote:
>
>> Correction - it did work.  I was expecting it to be in the same folder as
>> where I ran nifi.sh from, vs NIFI_HOME.
>>
>> Reviewing it now...
>>
>> Ryan
>>
>> On Thu, Sep 17, 2020 at 1:51 PM Ryan Hendrickson <
>> ryan.andrew.hendrickson@gmail.com> wrote:
>>
>>> Hey Mark,
>>> I should have mentioned the PutElasticsearchHttp is going to 2 different
>>> clusters.  We did play with different thread counts for each of them.  At
>>> one point were wondering if too large a Batch Size would make the threads
>>> block each.
>>>
>>> It looks like PutElasticsearchHttp serializes every FlowFile to verify
>>> it's a well-formed JSON document [1].  That alone feels pretty CPU
>>> expensive.. In our case, we know already we have valid JSON.  Just as an
>>> anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses
>>> a total of 9 threads to accomplish the same thing that [2x DistributeLoad +
>>> 2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5
>>> threads each to keep up.  PutElasticsearchHTTP needs about 10 each.
>>>
>>> PutElasticsearchHTTP is configured like this:
>>> Index: ${esIndex}
>>> Batch Size: 3000
>>> Index Operation: Index
>>>
>>> For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to
>>> export TOOLS_JAR on the command line to the path where tools.jar was
>>> located.
>>>
>>> I'm not getting a file written out though.  I still have the "full" NiFi
>>> up and running.  I assume that should be?  Do I need to change my
>>> logback.xml levels at all?
>>>
>>>
>>> [1]
>>> https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299
>>>
>>> Thanks,
>>> Ryan
>>>
>>> On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <ma...@hotmail.com>
>>> wrote:
>>>
>>>> Ryan,
>>>>
>>>> Why are you using DistributeLoad to go to two different
>>>> PutElasticsearchHttp processors? Does that perform better for you than a
>>>> single PutElasticsearchHttp processors with multiple concurrent tasks? It
>>>> shouldn’t really. I’ve never used that processor, but if two instances of
>>>> the processor perform significantly better than 1 instance with 2
>>>> concurrent tasks, that’s probably worth looking into.
>>>>
>>>> -Mark
>>>>
>>>>
>>>> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <
>>>> ryan.andrew.hendrickson@gmail.com> wrote:
>>>>
>>>> @Joe I can't export the flow.xml.gz easily, although it's pretty
>>>> simple.  We put just the following on it's own server because
>>>> DistributeLoad (bug [1]) and PutElasticsearchHttp have a hard time keeping
>>>> up.
>>>>
>>>>    1. Input Port
>>>>    2. ControlRate (data rate | 1.7GB | 5 min)
>>>>    3. Update Attributes (Delete Attribute Regex)
>>>>    4. JoltTransformJSON
>>>>    5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns
>>>>    it into Objects)
>>>>    6. DistributeLoad
>>>>       1. PutElasticsearchHttp
>>>>       2. PutElasticsearchHttp
>>>>
>>>>
>>>> Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo
>>>> to see if that's more performant than PutElasticsearchHttp.. The Elastic
>>>> one uses an ObjectMapper, and string replacements, etc.  It seems to cap
>>>> out around 2-3GB/5 minutes
>>>>
>>>> @Mark I'll check the diagnostics.
>>>>
>>>> @Jim definitely disk space 100% used.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/NIFI-1121
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <
>>>> jwilliams@alertlogic.com> wrote:
>>>>
>>>>> Ryan,
>>>>>
>>>>>
>>>>>
>>>>> Is this this maybe a case of exhausting inodes on the filesystem
>>>>> rather than exhausting the space available?  If you do a ‘df -I’ on the
>>>>> system what do you see for inode usage?
>>>>>
>>>>>
>>>>>
>>>>> Warm regards,
>>>>>
>>>>>
>>>>>
>>>>> <image001.jpg> <https://www.alertlogic.com/>
>>>>>
>>>>> *Jim Williams* | Manager, Site Reliability Engineering
>>>>>
>>>>> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com |
>>>>> alertlogic.com <http://www.alertlogic.com/> <image002.png>
>>>>> <https://twitter.com/alertlogic><image003.png>
>>>>> <https://www.linkedin.com/company/alert-logic>
>>>>>
>>>>>
>>>>>
>>>>> <image004.png>
>>>>>
>>>>>
>>>>>
>>>>> *From:* Joe Witt <jo...@gmail.com>
>>>>> *Sent:* Thursday, September 17, 2020 10:19 AM
>>>>> *To:* users@nifi.apache.org
>>>>> *Subject:* Re: Content Claims Filling Disk - Best practice for small
>>>>> files?
>>>>>
>>>>>
>>>>>
>>>>> can you share your flow.xml.gz?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
>>>>> ryan.andrew.hendrickson@gmail.com> wrote:
>>>>>
>>>>> 1.12.0
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com> wrote:
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>> What version are you using? I do think we had an issue that kept items
>>>>> around longer than intended that has been addressed.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>>>>> ryan.andrew.hendrickson@gmail.com> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about
>>>>> 55GB of data on my canvas.
>>>>>
>>>>>
>>>>>
>>>>> However, the content repository (on it's own partition) is
>>>>> completely full with 350GB of data.  I'm pretty certain the way Content
>>>>> Claims store the data is responsible for this.  In previous experience,
>>>>> we've had files that are larger, and haven't seen this as much.
>>>>>
>>>>>
>>>>>
>>>>> My guess is that as data was streaming through and being added to a
>>>>> claim, it isn't always released as the small files leaves the canvas.
>>>>>
>>>>>
>>>>>
>>>>> We've run into this issue enough times that I figure there's probably
>>>>> a "best practice for small files" for the content claims settings.
>>>>>
>>>>>
>>>>>
>>>>> These are our current settings:
>>>>>
>>>>>
>>>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>>>>
>>>>> nifi.content.claim.max.appendable.size=1 MB
>>>>>
>>>>> nifi.content.claim.max.flow.files=100
>>>>>
>>>>>
>>>>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>>>>>
>>>>> nifi.content.repository.archive.max.retention.period=12 hours
>>>>>
>>>>> nifi.content.repository.archive.max.usage.percentage=50%
>>>>>
>>>>> nifi.content.repository.archive.enabled=true
>>>>>
>>>>> nifi.content.repository.always.sync=false
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>>>>
>>>>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>>>>> because I thought the max appendable size would make this no larger than
>>>>> 1MB.)
>>>>>
>>>>>
>>>>>
>>>>> Is there a way to expand the number of folders and/or reduce the
>>>>> amount of individual FlowFiles that are stored in the claims?
>>>>>
>>>>>
>>>>>
>>>>> I'm hoping there might be a best practice out there though.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>> Confidentiality Notice | This email and any included attachments may
>>>>> be privileged, confidential and/or otherwise protected from disclosure.
>>>>> Access to this email by anyone other than the intended recipient is
>>>>> unauthorized. If you believe you have received this email in error, please
>>>>> contact the sender immediately and delete all copies. If you are not the
>>>>> intended recipient, you are notified that disclosing, copying, distributing
>>>>> or taking any action in reliance on the contents of this information is
>>>>> strictly prohibited.
>>>>>
>>>>> *Disclaimer*
>>>>>
>>>>> The information contained in this communication from the sender is
>>>>> confidential. It is intended solely for use by the recipient and others
>>>>> authorized to receive it. If you are not the recipient, you are hereby
>>>>> notified that any disclosure, copying, distribution or taking action in
>>>>> relation of the contents of this information is strictly prohibited and may
>>>>> be unlawful.
>>>>>
>>>>> This email has been scanned for viruses and malware, and may have been
>>>>> automatically archived by Mimecast, a leader in email security and cyber
>>>>> resilience. Mimecast integrates email defenses with brand protection,
>>>>> security awareness training, web security, compliance and other essential
>>>>> capabilities. Mimecast helps protect large and small organizations from
>>>>> malicious activity, human error and technology failure; and to lead the
>>>>> movement toward building a more resilient world. To find out more, visit
>>>>> our website.
>>>>>
>>>>
>>>>
>

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Mark Payne <ma...@hotmail.com>.

Ryan,

OK, thanks. So the “100 based on the max size” is… “fun.” Not entirely sure when that property made it into nifi.properties - I’m guessing that when the max.appendable.claim.size was added, we intended to also implement a max number of FlowFiles. But it was never implemented. So I think this probably has never been used and it was just a bug that it ever made it into nifi.properties. I think that was actually cleaned up in 1.12.0.

What will be interesting is if you wait, say 5 minutes, and then run the diagnostics dump again. Are there any files that previously had a Claimant Count of 0 and that had In Use = false, that still exist with a Claimant Count of 0? If not, then at least we know that cleanup is working properly. If there are, then that would indicate that the content repo is not cleaning up properly.



On Sep 17, 2020, at 3:38 PM, Ryan Hendrickson <ry...@gmail.com>> wrote:

A couple things from it:

1. The sum of the "Claimant counts" equals the number of FlowFiles reported on the Canvas.
2. None are Awaiting Destruction
3. Claimant Count Lowest number is 1 (when it's not zero)
4. Claimant Count Highest number is 4,773  (Should this one be 100 based on the max size, but maybe not if more than 100 is read in a single session?)
5. The sum of the "References" is 64,814.
6. The lowest Reference is 1 (when it's not zero)
7. The highest Reference is 4,773
8. Some References have Swap Files (10,006) and others have FlowFiles (470)
9. There are 10,468 "In Use"

Anything there stick out to anyone?

Thanks,
Ryan

On Thu, Sep 17, 2020 at 2:29 PM Ryan Hendrickson <ry...@gmail.com>> wrote:
Correction - it did work.  I was expecting it to be in the same folder as where I ran nifi.sh from, vs NIFI_HOME.

Reviewing it now...

Ryan

On Thu, Sep 17, 2020 at 1:51 PM Ryan Hendrickson <ry...@gmail.com>> wrote:
Hey Mark,
I should have mentioned the PutElasticsearchHttp is going to 2 different clusters.  We did play with different thread counts for each of them.  At one point were wondering if too large a Batch Size would make the threads block each.

It looks like PutElasticsearchHttp serializes every FlowFile to verify it's a well-formed JSON document [1].  That alone feels pretty CPU expensive.. In our case, we know already we have valid JSON.  Just as an anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses a total of 9 threads to accomplish the same thing that [2x DistributeLoad + 2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5 threads each to keep up.  PutElasticsearchHTTP needs about 10 each.

PutElasticsearchHTTP is configured like this:
Index: ${esIndex}
Batch Size: 3000
Index Operation: Index

For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to export TOOLS_JAR on the command line to the path where tools.jar was located.

I'm not getting a file written out though.  I still have the "full" NiFi up and running.  I assume that should be?  Do I need to change my logback.xml levels at all?


[1] https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <ma...@hotmail.com>> wrote:
Ryan,

Why are you using DistributeLoad to go to two different PutElasticsearchHttp processors? Does that perform better for you than a single PutElasticsearchHttp processors with multiple concurrent tasks? It shouldn’t really. I’ve never used that processor, but if two instances of the processor perform significantly better than 1 instance with 2 concurrent tasks, that’s probably worth looking into.

-Mark


On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <ry...@gmail.com>> wrote:

@Joe I can't export the flow.xml.gz easily, although it's pretty simple.  We put just the following on it's own server because DistributeLoad (bug [1]) and PutElasticsearchHttp have a hard time keeping up.

  1.  Input Port
  2.  ControlRate (data rate | 1.7GB | 5 min)
  3.  Update Attributes (Delete Attribute Regex)
  4.  JoltTransformJSON
  5.  FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it into Objects)
  6.  DistributeLoad
     *   PutElasticsearchHttp
     *   PutElasticsearchHttp

Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo to see if that's more performant than PutElasticsearchHttp.. The Elastic one uses an ObjectMapper, and string replacements, etc.  It seems to cap out around 2-3GB/5 minutes

@Mark I'll check the diagnostics.

@Jim definitely disk space 100% used.

[1] https://issues.apache.org/jira/browse/NIFI-1121

Ryan

On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jw...@alertlogic.com>> wrote:
Ryan,

Is this this maybe a case of exhausting inodes on the filesystem rather than exhausting the space available?  If you do a ‘df -I’ on the system what do you see for inode usage?

Warm regards,

<image001.jpg><https://www.alertlogic.com/>
Jim Williams | Manager, Site Reliability Engineering
O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com<ma...@alertlogic.com> | alertlogic.com<http://www.alertlogic.com/> <image002.png><https://twitter.com/alertlogic><image003.png><https://www.linkedin.com/company/alert-logic>

<image004.png>

From: Joe Witt <jo...@gmail.com>>
Sent: Thursday, September 17, 2020 10:19 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: Re: Content Claims Filling Disk - Best practice for small files?

can you share your flow.xml.gz?

On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <ry...@gmail.com>> wrote:
1.12.0

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com>> wrote:
Ryan

What version are you using? I do think we had an issue that kept items around longer than intended that has been addressed.

Thanks

On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <ry...@gmail.com>> wrote:
Hello,
I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB of data on my canvas.

However, the content repository (on it's own partition) is completely full with 350GB of data.  I'm pretty certain the way Content Claims store the data is responsible for this.  In previous experience, we've had files that are larger, and haven't seen this as much.

My guess is that as data was streaming through and being added to a claim, it isn't always released as the small files leaves the canvas.

We've run into this issue enough times that I figure there's probably a "best practice for small files" for the content claims settings.

These are our current settings:
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=/var/nifi/repositories/content
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository

There's 1024 folders on the disk (0-1023) for the Content Claims.
Each file inside the folders are roughly  2MB to 8 MB (Which is odd because I thought the max appendable size would make this no larger than 1MB.)

Is there a way to expand the number of folders and/or reduce the amount of individual FlowFiles that are stored in the claims?

I'm hoping there might be a best practice out there though.

Thanks,
Ryan

Confidentiality Notice | This email and any included attachments may be privileged, confidential and/or otherwise protected from disclosure. Access to this email by anyone other than the intended recipient is unauthorized. If you believe you have received this email in error, please contact the sender immediately and delete all copies. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.


Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website.

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Ryan Hendrickson <ry...@gmail.com>.

A couple things from it:

1. The sum of the "Claimant counts" equals the number of FlowFiles reported
on the Canvas.
2. None are Awaiting Destruction
3. Claimant Count Lowest number is 1 (when it's not zero)
4. Claimant Count Highest number is 4,773  (Should this one be 100 based on
the max size, but maybe not if more than 100 is read in a single session?)
5. The sum of the "References" is 64,814.
6. The lowest Reference is 1 (when it's not zero)
7. The highest Reference is 4,773
8. Some References have Swap Files (10,006) and others have FlowFiles (470)
9. There are 10,468 "In Use"

Anything there stick out to anyone?

Thanks,
Ryan

On Thu, Sep 17, 2020 at 2:29 PM Ryan Hendrickson <
ryan.andrew.hendrickson@gmail.com> wrote:

> Correction - it did work.  I was expecting it to be in the same folder as
> where I ran nifi.sh from, vs NIFI_HOME.
>
> Reviewing it now...
>
> Ryan
>
> On Thu, Sep 17, 2020 at 1:51 PM Ryan Hendrickson <
> ryan.andrew.hendrickson@gmail.com> wrote:
>
>> Hey Mark,
>> I should have mentioned the PutElasticsearchHttp is going to 2 different
>> clusters.  We did play with different thread counts for each of them.  At
>> one point were wondering if too large a Batch Size would make the threads
>> block each.
>>
>> It looks like PutElasticsearchHttp serializes every FlowFile to verify
>> it's a well-formed JSON document [1].  That alone feels pretty CPU
>> expensive.. In our case, we know already we have valid JSON.  Just as an
>> anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses
>> a total of 9 threads to accomplish the same thing that [2x DistributeLoad +
>> 2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5
>> threads each to keep up.  PutElasticsearchHTTP needs about 10 each.
>>
>> PutElasticsearchHTTP is configured like this:
>> Index: ${esIndex}
>> Batch Size: 3000
>> Index Operation: Index
>>
>> For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to export
>> TOOLS_JAR on the command line to the path where tools.jar was located.
>>
>> I'm not getting a file written out though.  I still have the "full" NiFi
>> up and running.  I assume that should be?  Do I need to change my
>> logback.xml levels at all?
>>
>>
>> [1]
>> https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299
>>
>> Thanks,
>> Ryan
>>
>> On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <ma...@hotmail.com> wrote:
>>
>>> Ryan,
>>>
>>> Why are you using DistributeLoad to go to two different
>>> PutElasticsearchHttp processors? Does that perform better for you than a
>>> single PutElasticsearchHttp processors with multiple concurrent tasks? It
>>> shouldn’t really. I’ve never used that processor, but if two instances of
>>> the processor perform significantly better than 1 instance with 2
>>> concurrent tasks, that’s probably worth looking into.
>>>
>>> -Mark
>>>
>>>
>>> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <
>>> ryan.andrew.hendrickson@gmail.com> wrote:
>>>
>>> @Joe I can't export the flow.xml.gz easily, although it's pretty
>>> simple.  We put just the following on it's own server because
>>> DistributeLoad (bug [1]) and PutElasticsearchHttp have a hard time keeping
>>> up.
>>>
>>>    1. Input Port
>>>    2. ControlRate (data rate | 1.7GB | 5 min)
>>>    3. Update Attributes (Delete Attribute Regex)
>>>    4. JoltTransformJSON
>>>    5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns
>>>    it into Objects)
>>>    6. DistributeLoad
>>>       1. PutElasticsearchHttp
>>>       2. PutElasticsearchHttp
>>>
>>>
>>> Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo
>>> to see if that's more performant than PutElasticsearchHttp.. The Elastic
>>> one uses an ObjectMapper, and string replacements, etc.  It seems to cap
>>> out around 2-3GB/5 minutes
>>>
>>> @Mark I'll check the diagnostics.
>>>
>>> @Jim definitely disk space 100% used.
>>>
>>> [1] https://issues.apache.org/jira/browse/NIFI-1121
>>>
>>> Ryan
>>>
>>> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jw...@alertlogic.com>
>>> wrote:
>>>
>>>> Ryan,
>>>>
>>>>
>>>>
>>>> Is this this maybe a case of exhausting inodes on the filesystem rather
>>>> than exhausting the space available?  If you do a ‘df -I’ on the system
>>>> what do you see for inode usage?
>>>>
>>>>
>>>>
>>>> Warm regards,
>>>>
>>>>
>>>>
>>>> <image001.jpg> <https://www.alertlogic.com/>
>>>>
>>>> *Jim Williams* | Manager, Site Reliability Engineering
>>>>
>>>> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com |
>>>> alertlogic.com <http://www.alertlogic.com/> <image002.png>
>>>> <https://twitter.com/alertlogic><image003.png>
>>>> <https://www.linkedin.com/company/alert-logic>
>>>>
>>>>
>>>>
>>>> <image004.png>
>>>>
>>>>
>>>>
>>>> *From:* Joe Witt <jo...@gmail.com>
>>>> *Sent:* Thursday, September 17, 2020 10:19 AM
>>>> *To:* users@nifi.apache.org
>>>> *Subject:* Re: Content Claims Filling Disk - Best practice for small
>>>> files?
>>>>
>>>>
>>>>
>>>> can you share your flow.xml.gz?
>>>>
>>>>
>>>>
>>>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
>>>> ryan.andrew.hendrickson@gmail.com> wrote:
>>>>
>>>> 1.12.0
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Ryan
>>>>
>>>>
>>>>
>>>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com> wrote:
>>>>
>>>> Ryan
>>>>
>>>>
>>>>
>>>> What version are you using? I do think we had an issue that kept items
>>>> around longer than intended that has been addressed.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>>>> ryan.andrew.hendrickson@gmail.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB
>>>> of data on my canvas.
>>>>
>>>>
>>>>
>>>> However, the content repository (on it's own partition) is
>>>> completely full with 350GB of data.  I'm pretty certain the way Content
>>>> Claims store the data is responsible for this.  In previous experience,
>>>> we've had files that are larger, and haven't seen this as much.
>>>>
>>>>
>>>>
>>>> My guess is that as data was streaming through and being added to a
>>>> claim, it isn't always released as the small files leaves the canvas.
>>>>
>>>>
>>>>
>>>> We've run into this issue enough times that I figure there's probably a
>>>> "best practice for small files" for the content claims settings.
>>>>
>>>>
>>>>
>>>> These are our current settings:
>>>>
>>>>
>>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>>>
>>>> nifi.content.claim.max.appendable.size=1 MB
>>>>
>>>> nifi.content.claim.max.flow.files=100
>>>>
>>>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>>>>
>>>> nifi.content.repository.archive.max.retention.period=12 hours
>>>>
>>>> nifi.content.repository.archive.max.usage.percentage=50%
>>>>
>>>> nifi.content.repository.archive.enabled=true
>>>>
>>>> nifi.content.repository.always.sync=false
>>>>
>>>>
>>>>
>>>>
>>>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>>>
>>>>
>>>>
>>>>
>>>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>>>
>>>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>>>> because I thought the max appendable size would make this no larger than
>>>> 1MB.)
>>>>
>>>>
>>>>
>>>> Is there a way to expand the number of folders and/or reduce the amount
>>>> of individual FlowFiles that are stored in the claims?
>>>>
>>>>
>>>>
>>>> I'm hoping there might be a best practice out there though.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Ryan
>>>>
>>>>
>>>>
>>>> Confidentiality Notice | This email and any included attachments may be
>>>> privileged, confidential and/or otherwise protected from disclosure. Access
>>>> to this email by anyone other than the intended recipient is unauthorized.
>>>> If you believe you have received this email in error, please contact the
>>>> sender immediately and delete all copies. If you are not the intended
>>>> recipient, you are notified that disclosing, copying, distributing or
>>>> taking any action in reliance on the contents of this information is
>>>> strictly prohibited.
>>>>
>>>> *Disclaimer*
>>>>
>>>> The information contained in this communication from the sender is
>>>> confidential. It is intended solely for use by the recipient and others
>>>> authorized to receive it. If you are not the recipient, you are hereby
>>>> notified that any disclosure, copying, distribution or taking action in
>>>> relation of the contents of this information is strictly prohibited and may
>>>> be unlawful.
>>>>
>>>> This email has been scanned for viruses and malware, and may have been
>>>> automatically archived by Mimecast, a leader in email security and cyber
>>>> resilience. Mimecast integrates email defenses with brand protection,
>>>> security awareness training, web security, compliance and other essential
>>>> capabilities. Mimecast helps protect large and small organizations from
>>>> malicious activity, human error and technology failure; and to lead the
>>>> movement toward building a more resilient world. To find out more, visit
>>>> our website.
>>>>
>>>
>>>

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Ryan Hendrickson <ry...@gmail.com>.

Correction - it did work.  I was expecting it to be in the same folder as
where I ran nifi.sh from, vs NIFI_HOME.

Reviewing it now...

Ryan

On Thu, Sep 17, 2020 at 1:51 PM Ryan Hendrickson <
ryan.andrew.hendrickson@gmail.com> wrote:

> Hey Mark,
> I should have mentioned the PutElasticsearchHttp is going to 2 different
> clusters.  We did play with different thread counts for each of them.  At
> one point were wondering if too large a Batch Size would make the threads
> block each.
>
> It looks like PutElasticsearchHttp serializes every FlowFile to verify
> it's a well-formed JSON document [1].  That alone feels pretty CPU
> expensive.. In our case, we know already we have valid JSON.  Just as an
> anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses
> a total of 9 threads to accomplish the same thing that [2x DistributeLoad +
> 2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5
> threads each to keep up.  PutElasticsearchHTTP needs about 10 each.
>
> PutElasticsearchHTTP is configured like this:
> Index: ${esIndex}
> Batch Size: 3000
> Index Operation: Index
>
> For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to export
> TOOLS_JAR on the command line to the path where tools.jar was located.
>
> I'm not getting a file written out though.  I still have the "full" NiFi
> up and running.  I assume that should be?  Do I need to change my
> logback.xml levels at all?
>
>
> [1]
> https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299
>
> Thanks,
> Ryan
>
> On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <ma...@hotmail.com> wrote:
>
>> Ryan,
>>
>> Why are you using DistributeLoad to go to two different
>> PutElasticsearchHttp processors? Does that perform better for you than a
>> single PutElasticsearchHttp processors with multiple concurrent tasks? It
>> shouldn’t really. I’ve never used that processor, but if two instances of
>> the processor perform significantly better than 1 instance with 2
>> concurrent tasks, that’s probably worth looking into.
>>
>> -Mark
>>
>>
>> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <
>> ryan.andrew.hendrickson@gmail.com> wrote:
>>
>> @Joe I can't export the flow.xml.gz easily, although it's pretty simple.
>> We put just the following on it's own server because DistributeLoad (bug
>> [1]) and PutElasticsearchHttp have a hard time keeping up.
>>
>>    1. Input Port
>>    2. ControlRate (data rate | 1.7GB | 5 min)
>>    3. Update Attributes (Delete Attribute Regex)
>>    4. JoltTransformJSON
>>    5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it
>>    into Objects)
>>    6. DistributeLoad
>>       1. PutElasticsearchHttp
>>       2. PutElasticsearchHttp
>>
>>
>> Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo
>> to see if that's more performant than PutElasticsearchHttp.. The Elastic
>> one uses an ObjectMapper, and string replacements, etc.  It seems to cap
>> out around 2-3GB/5 minutes
>>
>> @Mark I'll check the diagnostics.
>>
>> @Jim definitely disk space 100% used.
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-1121
>>
>> Ryan
>>
>> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jw...@alertlogic.com>
>> wrote:
>>
>>> Ryan,
>>>
>>>
>>>
>>> Is this this maybe a case of exhausting inodes on the filesystem rather
>>> than exhausting the space available?  If you do a ‘df -I’ on the system
>>> what do you see for inode usage?
>>>
>>>
>>>
>>> Warm regards,
>>>
>>>
>>>
>>> <image001.jpg> <https://www.alertlogic.com/>
>>>
>>> *Jim Williams* | Manager, Site Reliability Engineering
>>>
>>> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com |
>>> alertlogic.com <http://www.alertlogic.com/> <image002.png>
>>> <https://twitter.com/alertlogic><image003.png>
>>> <https://www.linkedin.com/company/alert-logic>
>>>
>>>
>>>
>>> <image004.png>
>>>
>>>
>>>
>>> *From:* Joe Witt <jo...@gmail.com>
>>> *Sent:* Thursday, September 17, 2020 10:19 AM
>>> *To:* users@nifi.apache.org
>>> *Subject:* Re: Content Claims Filling Disk - Best practice for small
>>> files?
>>>
>>>
>>>
>>> can you share your flow.xml.gz?
>>>
>>>
>>>
>>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
>>> ryan.andrew.hendrickson@gmail.com> wrote:
>>>
>>> 1.12.0
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ryan
>>>
>>>
>>>
>>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com> wrote:
>>>
>>> Ryan
>>>
>>>
>>>
>>> What version are you using? I do think we had an issue that kept items
>>> around longer than intended that has been addressed.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>>> ryan.andrew.hendrickson@gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB
>>> of data on my canvas.
>>>
>>>
>>>
>>> However, the content repository (on it's own partition) is
>>> completely full with 350GB of data.  I'm pretty certain the way Content
>>> Claims store the data is responsible for this.  In previous experience,
>>> we've had files that are larger, and haven't seen this as much.
>>>
>>>
>>>
>>> My guess is that as data was streaming through and being added to a
>>> claim, it isn't always released as the small files leaves the canvas.
>>>
>>>
>>>
>>> We've run into this issue enough times that I figure there's probably a
>>> "best practice for small files" for the content claims settings.
>>>
>>>
>>>
>>> These are our current settings:
>>>
>>>
>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>>
>>> nifi.content.claim.max.appendable.size=1 MB
>>>
>>> nifi.content.claim.max.flow.files=100
>>>
>>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>>>
>>> nifi.content.repository.archive.max.retention.period=12 hours
>>>
>>> nifi.content.repository.archive.max.usage.percentage=50%
>>>
>>> nifi.content.repository.archive.enabled=true
>>>
>>> nifi.content.repository.always.sync=false
>>>
>>>
>>>
>>>
>>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>>
>>>
>>>
>>>
>>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>>
>>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>>> because I thought the max appendable size would make this no larger than
>>> 1MB.)
>>>
>>>
>>>
>>> Is there a way to expand the number of folders and/or reduce the amount
>>> of individual FlowFiles that are stored in the claims?
>>>
>>>
>>>
>>> I'm hoping there might be a best practice out there though.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ryan
>>>
>>>
>>>
>>> Confidentiality Notice | This email and any included attachments may be
>>> privileged, confidential and/or otherwise protected from disclosure. Access
>>> to this email by anyone other than the intended recipient is unauthorized.
>>> If you believe you have received this email in error, please contact the
>>> sender immediately and delete all copies. If you are not the intended
>>> recipient, you are notified that disclosing, copying, distributing or
>>> taking any action in reliance on the contents of this information is
>>> strictly prohibited.
>>>
>>> *Disclaimer*
>>>
>>> The information contained in this communication from the sender is
>>> confidential. It is intended solely for use by the recipient and others
>>> authorized to receive it. If you are not the recipient, you are hereby
>>> notified that any disclosure, copying, distribution or taking action in
>>> relation of the contents of this information is strictly prohibited and may
>>> be unlawful.
>>>
>>> This email has been scanned for viruses and malware, and may have been
>>> automatically archived by Mimecast, a leader in email security and cyber
>>> resilience. Mimecast integrates email defenses with brand protection,
>>> security awareness training, web security, compliance and other essential
>>> capabilities. Mimecast helps protect large and small organizations from
>>> malicious activity, human error and technology failure; and to lead the
>>> movement toward building a more resilient world. To find out more, visit
>>> our website.
>>>
>>
>>

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Ryan Hendrickson <ry...@gmail.com>.

Hey Mark,
I should have mentioned the PutElasticsearchHttp is going to 2 different
clusters.  We did play with different thread counts for each of them.  At
one point were wondering if too large a Batch Size would make the threads
block each.

It looks like PutElasticsearchHttp serializes every FlowFile to verify it's
a well-formed JSON document [1].  That alone feels pretty CPU expensive..
In our case, we know already we have valid JSON.  Just as an
anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses
a total of 9 threads to accomplish the same thing that [2x DistributeLoad +
2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5
threads each to keep up.  PutElasticsearchHTTP needs about 10 each.

PutElasticsearchHTTP is configured like this:
Index: ${esIndex}
Batch Size: 3000
Index Operation: Index

For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to export
TOOLS_JAR on the command line to the path where tools.jar was located.

I'm not getting a file written out though.  I still have the "full" NiFi up
and running.  I assume that should be?  Do I need to change my logback.xml
levels at all?


[1]
https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <ma...@hotmail.com> wrote:

> Ryan,
>
> Why are you using DistributeLoad to go to two different
> PutElasticsearchHttp processors? Does that perform better for you than a
> single PutElasticsearchHttp processors with multiple concurrent tasks? It
> shouldn’t really. I’ve never used that processor, but if two instances of
> the processor perform significantly better than 1 instance with 2
> concurrent tasks, that’s probably worth looking into.
>
> -Mark
>
>
> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <
> ryan.andrew.hendrickson@gmail.com> wrote:
>
> @Joe I can't export the flow.xml.gz easily, although it's pretty simple.
> We put just the following on it's own server because DistributeLoad (bug
> [1]) and PutElasticsearchHttp have a hard time keeping up.
>
>    1. Input Port
>    2. ControlRate (data rate | 1.7GB | 5 min)
>    3. Update Attributes (Delete Attribute Regex)
>    4. JoltTransformJSON
>    5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it
>    into Objects)
>    6. DistributeLoad
>       1. PutElasticsearchHttp
>       2. PutElasticsearchHttp
>
>
> Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo to
> see if that's more performant than PutElasticsearchHttp.. The Elastic one
> uses an ObjectMapper, and string replacements, etc.  It seems to cap out
> around 2-3GB/5 minutes
>
> @Mark I'll check the diagnostics.
>
> @Jim definitely disk space 100% used.
>
> [1] https://issues.apache.org/jira/browse/NIFI-1121
>
> Ryan
>
> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jw...@alertlogic.com>
> wrote:
>
>> Ryan,
>>
>>
>>
>> Is this this maybe a case of exhausting inodes on the filesystem rather
>> than exhausting the space available?  If you do a ‘df -I’ on the system
>> what do you see for inode usage?
>>
>>
>>
>> Warm regards,
>>
>>
>>
>> <image001.jpg> <https://www.alertlogic.com/>
>>
>> *Jim Williams* | Manager, Site Reliability Engineering
>>
>> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com |
>> alertlogic.com <http://www.alertlogic.com/> <image002.png>
>> <https://twitter.com/alertlogic><image003.png>
>> <https://www.linkedin.com/company/alert-logic>
>>
>>
>>
>> <image004.png>
>>
>>
>>
>> *From:* Joe Witt <jo...@gmail.com>
>> *Sent:* Thursday, September 17, 2020 10:19 AM
>> *To:* users@nifi.apache.org
>> *Subject:* Re: Content Claims Filling Disk - Best practice for small
>> files?
>>
>>
>>
>> can you share your flow.xml.gz?
>>
>>
>>
>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
>> ryan.andrew.hendrickson@gmail.com> wrote:
>>
>> 1.12.0
>>
>>
>>
>> Thanks,
>>
>> Ryan
>>
>>
>>
>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com> wrote:
>>
>> Ryan
>>
>>
>>
>> What version are you using? I do think we had an issue that kept items
>> around longer than intended that has been addressed.
>>
>>
>>
>> Thanks
>>
>>
>>
>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>> ryan.andrew.hendrickson@gmail.com> wrote:
>>
>> Hello,
>>
>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB
>> of data on my canvas.
>>
>>
>>
>> However, the content repository (on it's own partition) is
>> completely full with 350GB of data.  I'm pretty certain the way Content
>> Claims store the data is responsible for this.  In previous experience,
>> we've had files that are larger, and haven't seen this as much.
>>
>>
>>
>> My guess is that as data was streaming through and being added to a
>> claim, it isn't always released as the small files leaves the canvas.
>>
>>
>>
>> We've run into this issue enough times that I figure there's probably a
>> "best practice for small files" for the content claims settings.
>>
>>
>>
>> These are our current settings:
>>
>>
>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>
>> nifi.content.claim.max.appendable.size=1 MB
>>
>> nifi.content.claim.max.flow.files=100
>>
>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>>
>> nifi.content.repository.archive.max.retention.period=12 hours
>>
>> nifi.content.repository.archive.max.usage.percentage=50%
>>
>> nifi.content.repository.archive.enabled=true
>>
>> nifi.content.repository.always.sync=false
>>
>>
>>
>>
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>
>>
>>
>>
>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>
>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>> because I thought the max appendable size would make this no larger than
>> 1MB.)
>>
>>
>>
>> Is there a way to expand the number of folders and/or reduce the amount
>> of individual FlowFiles that are stored in the claims?
>>
>>
>>
>> I'm hoping there might be a best practice out there though.
>>
>>
>>
>> Thanks,
>>
>> Ryan
>>
>>
>>
>> Confidentiality Notice | This email and any included attachments may be
>> privileged, confidential and/or otherwise protected from disclosure. Access
>> to this email by anyone other than the intended recipient is unauthorized.
>> If you believe you have received this email in error, please contact the
>> sender immediately and delete all copies. If you are not the intended
>> recipient, you are notified that disclosing, copying, distributing or
>> taking any action in reliance on the contents of this information is
>> strictly prohibited.
>>
>> *Disclaimer*
>>
>> The information contained in this communication from the sender is
>> confidential. It is intended solely for use by the recipient and others
>> authorized to receive it. If you are not the recipient, you are hereby
>> notified that any disclosure, copying, distribution or taking action in
>> relation of the contents of this information is strictly prohibited and may
>> be unlawful.
>>
>> This email has been scanned for viruses and malware, and may have been
>> automatically archived by Mimecast, a leader in email security and cyber
>> resilience. Mimecast integrates email defenses with brand protection,
>> security awareness training, web security, compliance and other essential
>> capabilities. Mimecast helps protect large and small organizations from
>> malicious activity, human error and technology failure; and to lead the
>> movement toward building a more resilient world. To find out more, visit
>> our website.
>>
>
>

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Mark Payne <ma...@hotmail.com>.

Ryan,

Why are you using DistributeLoad to go to two different PutElasticsearchHttp processors? Does that perform better for you than a single PutElasticsearchHttp processors with multiple concurrent tasks? It shouldn’t really. I’ve never used that processor, but if two instances of the processor perform significantly better than 1 instance with 2 concurrent tasks, that’s probably worth looking into.

-Mark


On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <ry...@gmail.com>> wrote:

@Joe I can't export the flow.xml.gz easily, although it's pretty simple.  We put just the following on it's own server because DistributeLoad (bug [1]) and PutElasticsearchHttp have a hard time keeping up.

  1.  Input Port
  2.  ControlRate (data rate | 1.7GB | 5 min)
  3.  Update Attributes (Delete Attribute Regex)
  4.  JoltTransformJSON
  5.  FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it into Objects)
  6.  DistributeLoad
     *   PutElasticsearchHttp
     *   PutElasticsearchHttp

Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo to see if that's more performant than PutElasticsearchHttp.. The Elastic one uses an ObjectMapper, and string replacements, etc.  It seems to cap out around 2-3GB/5 minutes

@Mark I'll check the diagnostics.

@Jim definitely disk space 100% used.

[1] https://issues.apache.org/jira/browse/NIFI-1121

Ryan

On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jw...@alertlogic.com>> wrote:
Ryan,

Is this this maybe a case of exhausting inodes on the filesystem rather than exhausting the space available?  If you do a ‘df -I’ on the system what do you see for inode usage?

Warm regards,

<image001.jpg><https://www.alertlogic.com/>
Jim Williams | Manager, Site Reliability Engineering
O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com<ma...@alertlogic.com> | alertlogic.com<http://www.alertlogic.com/> <image002.png><https://twitter.com/alertlogic><image003.png><https://www.linkedin.com/company/alert-logic>

<image004.png>

From: Joe Witt <jo...@gmail.com>>
Sent: Thursday, September 17, 2020 10:19 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: Re: Content Claims Filling Disk - Best practice for small files?

can you share your flow.xml.gz?

On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <ry...@gmail.com>> wrote:
1.12.0

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com>> wrote:
Ryan

What version are you using? I do think we had an issue that kept items around longer than intended that has been addressed.

Thanks

On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <ry...@gmail.com>> wrote:
Hello,
I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB of data on my canvas.

However, the content repository (on it's own partition) is completely full with 350GB of data.  I'm pretty certain the way Content Claims store the data is responsible for this.  In previous experience, we've had files that are larger, and haven't seen this as much.

My guess is that as data was streaming through and being added to a claim, it isn't always released as the small files leaves the canvas.

We've run into this issue enough times that I figure there's probably a "best practice for small files" for the content claims settings.

These are our current settings:
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=/var/nifi/repositories/content
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository

There's 1024 folders on the disk (0-1023) for the Content Claims.
Each file inside the folders are roughly  2MB to 8 MB (Which is odd because I thought the max appendable size would make this no larger than 1MB.)

Is there a way to expand the number of folders and/or reduce the amount of individual FlowFiles that are stored in the claims?

I'm hoping there might be a best practice out there though.

Thanks,
Ryan

Confidentiality Notice | This email and any included attachments may be privileged, confidential and/or otherwise protected from disclosure. Access to this email by anyone other than the intended recipient is unauthorized. If you believe you have received this email in error, please contact the sender immediately and delete all copies. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.


Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website.

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Ryan Hendrickson <ry...@gmail.com>.

@Joe I can't export the flow.xml.gz easily, although it's pretty simple.
We put just the following on it's own server because DistributeLoad (bug
[1]) and PutElasticsearchHttp have a hard time keeping up.

   1. Input Port
   2. ControlRate (data rate | 1.7GB | 5 min)
   3. Update Attributes (Delete Attribute Regex)
   4. JoltTransformJSON
   5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it
   into Objects)
   6. DistributeLoad
      1. PutElasticsearchHttp
      2. PutElasticsearchHttp


Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo to
see if that's more performant than PutElasticsearchHttp.. The Elastic one
uses an ObjectMapper, and string replacements, etc.  It seems to cap out
around 2-3GB/5 minutes

@Mark I'll check the diagnostics.

@Jim definitely disk space 100% used.

[1] https://issues.apache.org/jira/browse/NIFI-1121

Ryan

On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jw...@alertlogic.com>
wrote:

> Ryan,
>
>
>
> Is this this maybe a case of exhausting inodes on the filesystem rather
> than exhausting the space available?  If you do a ‘df -I’ on the system
> what do you see for inode usage?
>
>
>
> Warm regards,
>
>
>
> <https://www.alertlogic.com/>
>
> *Jim Williams* | Manager, Site Reliability Engineering
>
> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com |
> alertlogic.com <http://www.alertlogic.com/>
> <https://twitter.com/alertlogic>
> <https://www.linkedin.com/company/alert-logic>
>
>
>
>
>
> *From:* Joe Witt <jo...@gmail.com>
> *Sent:* Thursday, September 17, 2020 10:19 AM
> *To:* users@nifi.apache.org
> *Subject:* Re: Content Claims Filling Disk - Best practice for small
> files?
>
>
>
> can you share your flow.xml.gz?
>
>
>
> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
> ryan.andrew.hendrickson@gmail.com> wrote:
>
> 1.12.0
>
>
>
> Thanks,
>
> Ryan
>
>
>
> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com> wrote:
>
> Ryan
>
>
>
> What version are you using? I do think we had an issue that kept items
> around longer than intended that has been addressed.
>
>
>
> Thanks
>
>
>
> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
> ryan.andrew.hendrickson@gmail.com> wrote:
>
> Hello,
>
> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB of
> data on my canvas.
>
>
>
> However, the content repository (on it's own partition) is completely full
> with 350GB of data.  I'm pretty certain the way Content Claims store the
> data is responsible for this.  In previous experience, we've had files that
> are larger, and haven't seen this as much.
>
>
>
> My guess is that as data was streaming through and being added to a claim,
> it isn't always released as the small files leaves the canvas.
>
>
>
> We've run into this issue enough times that I figure there's probably a
> "best practice for small files" for the content claims settings.
>
>
>
> These are our current settings:
>
>
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>
> nifi.content.claim.max.appendable.size=1 MB
>
> nifi.content.claim.max.flow.files=100
>
> nifi.content.repository.directory.default=/var/nifi/repositories/content
>
> nifi.content.repository.archive.max.retention.period=12 hours
>
> nifi.content.repository.archive.max.usage.percentage=50%
>
> nifi.content.repository.archive.enabled=true
>
> nifi.content.repository.always.sync=false
>
>
>
>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>
>
>
>
> There's 1024 folders on the disk (0-1023) for the Content Claims.
>
> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
> because I thought the max appendable size would make this no larger than
> 1MB.)
>
>
>
> Is there a way to expand the number of folders and/or reduce the amount of
> individual FlowFiles that are stored in the claims?
>
>
>
> I'm hoping there might be a best practice out there though.
>
>
>
> Thanks,
>
> Ryan
>
>
>
> Confidentiality Notice | This email and any included attachments may be
> privileged, confidential and/or otherwise protected from disclosure. Access
> to this email by anyone other than the intended recipient is unauthorized.
> If you believe you have received this email in error, please contact the
> sender immediately and delete all copies. If you are not the intended
> recipient, you are notified that disclosing, copying, distributing or
> taking any action in reliance on the contents of this information is
> strictly prohibited.
>
> *Disclaimer*
>
> The information contained in this communication from the sender is
> confidential. It is intended solely for use by the recipient and others
> authorized to receive it. If you are not the recipient, you are hereby
> notified that any disclosure, copying, distribution or taking action in
> relation of the contents of this information is strictly prohibited and may
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been
> automatically archived by Mimecast, a leader in email security and cyber
> resilience. Mimecast integrates email defenses with brand protection,
> security awareness training, web security, compliance and other essential
> capabilities. Mimecast helps protect large and small organizations from
> malicious activity, human error and technology failure; and to lead the
> movement toward building a more resilient world. To find out more, visit
> our website.
>

RE: Content Claims Filling Disk - Best practice for small files?

Posted by "Williams, Jim" <jw...@alertlogic.com>.

Ryan,

Is this this maybe a case of exhausting inodes on the filesystem rather than exhausting the space available?  If you do a ‘df -I’ on the system what do you see for inode usage?

Warm regards,

[cid:image001.jpg@01D68CDD.FC463950]<https://www.alertlogic.com/>
Jim Williams | Manager, Site Reliability Engineering
O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com<ma...@alertlogic.com> | alertlogic.com<http://www.alertlogic.com/> [cid:image002.png@01D68CDD.FC463950] <https://twitter.com/alertlogic> [cid:image003.png@01D68CDD.FC463950] <https://www.linkedin.com/company/alert-logic>

[cid:image004.png@01D68CDD.FC463950]

From: Joe Witt <jo...@gmail.com>
Sent: Thursday, September 17, 2020 10:19 AM
To: users@nifi.apache.org
Subject: Re: Content Claims Filling Disk - Best practice for small files?

can you share your flow.xml.gz?

On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <ry...@gmail.com>> wrote:
1.12.0

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com>> wrote:
Ryan

What version are you using? I do think we had an issue that kept items around longer than intended that has been addressed.

Thanks

On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <ry...@gmail.com>> wrote:
Hello,
I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB of data on my canvas.

However, the content repository (on it's own partition) is completely full with 350GB of data.  I'm pretty certain the way Content Claims store the data is responsible for this.  In previous experience, we've had files that are larger, and haven't seen this as much.

My guess is that as data was streaming through and being added to a claim, it isn't always released as the small files leaves the canvas.

We've run into this issue enough times that I figure there's probably a "best practice for small files" for the content claims settings.

These are our current settings:
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=/var/nifi/repositories/content
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository

There's 1024 folders on the disk (0-1023) for the Content Claims.
Each file inside the folders are roughly  2MB to 8 MB (Which is odd because I thought the max appendable size would make this no larger than 1MB.)

Is there a way to expand the number of folders and/or reduce the amount of individual FlowFiles that are stored in the claims?

I'm hoping there might be a best practice out there though.

Thanks,
Ryan

Confidentiality Notice | This email and any included attachments may be privileged, confidential and/or otherwise protected from disclosure. Access to this email by anyone other than the intended recipient is unauthorized. If you believe you have received this email in error, please contact the sender immediately and delete all copies. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website.

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Joe Witt <jo...@gmail.com>.

can you share your flow.xml.gz?

On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
ryan.andrew.hendrickson@gmail.com> wrote:

> 1.12.0
>
> Thanks,
> Ryan
>
> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com> wrote:
>
>> Ryan
>>
>> What version are you using? I do think we had an issue that kept items
>> around longer than intended that has been addressed.
>>
>> Thanks
>>
>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>> ryan.andrew.hendrickson@gmail.com> wrote:
>>
>>> Hello,
>>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB
>>> of data on my canvas.
>>>
>>> However, the content repository (on it's own partition) is
>>> completely full with 350GB of data.  I'm pretty certain the way Content
>>> Claims store the data is responsible for this.  In previous experience,
>>> we've had files that are larger, and haven't seen this as much.
>>>
>>> My guess is that as data was streaming through and being added to a
>>> claim, it isn't always released as the small files leaves the canvas.
>>>
>>> We've run into this issue enough times that I figure there's probably a
>>> "best practice for small files" for the content claims settings.
>>>
>>> These are our current settings:
>>>
>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>> nifi.content.claim.max.appendable.size=1 MB
>>> nifi.content.claim.max.flow.files=100
>>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>>> nifi.content.repository.archive.max.retention.period=12 hours
>>> nifi.content.repository.archive.max.usage.percentage=50%
>>> nifi.content.repository.archive.enabled=true
>>> nifi.content.repository.always.sync=false
>>>
>>>
>>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>>
>>>
>>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>>> because I thought the max appendable size would make this no larger than
>>> 1MB.)
>>>
>>> Is there a way to expand the number of folders and/or reduce the amount
>>> of individual FlowFiles that are stored in the claims?
>>>
>>> I'm hoping there might be a best practice out there though.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Ryan Hendrickson <ry...@gmail.com>.

1.12.0

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <jo...@gmail.com> wrote:

> Ryan
>
> What version are you using? I do think we had an issue that kept items
> around longer than intended that has been addressed.
>
> Thanks
>
> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
> ryan.andrew.hendrickson@gmail.com> wrote:
>
>> Hello,
>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB
>> of data on my canvas.
>>
>> However, the content repository (on it's own partition) is
>> completely full with 350GB of data.  I'm pretty certain the way Content
>> Claims store the data is responsible for this.  In previous experience,
>> we've had files that are larger, and haven't seen this as much.
>>
>> My guess is that as data was streaming through and being added to a
>> claim, it isn't always released as the small files leaves the canvas.
>>
>> We've run into this issue enough times that I figure there's probably a
>> "best practice for small files" for the content claims settings.
>>
>> These are our current settings:
>>
>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>> nifi.content.claim.max.appendable.size=1 MB
>> nifi.content.claim.max.flow.files=100
>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>> nifi.content.repository.archive.max.retention.period=12 hours
>> nifi.content.repository.archive.max.usage.percentage=50%
>> nifi.content.repository.archive.enabled=true
>> nifi.content.repository.always.sync=false
>>
>>
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>
>>
>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>> because I thought the max appendable size would make this no larger than
>> 1MB.)
>>
>> Is there a way to expand the number of folders and/or reduce the amount
>> of individual FlowFiles that are stored in the claims?
>>
>> I'm hoping there might be a best practice out there though.
>>
>> Thanks,
>> Ryan
>>
>>
>

Re: Content Claims Filling Disk - Best practice for small files?

Posted by Joe Witt <jo...@gmail.com>.

Ryan

What version are you using? I do think we had an issue that kept items
around longer than intended that has been addressed.

Thanks

On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
ryan.andrew.hendrickson@gmail.com> wrote:

> Hello,
> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB of
> data on my canvas.
>
> However, the content repository (on it's own partition) is completely full
> with 350GB of data.  I'm pretty certain the way Content Claims store the
> data is responsible for this.  In previous experience, we've had files that
> are larger, and haven't seen this as much.
>
> My guess is that as data was streaming through and being added to a claim,
> it isn't always released as the small files leaves the canvas.
>
> We've run into this issue enough times that I figure there's probably a
> "best practice for small files" for the content claims settings.
>
> These are our current settings:
>
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
> nifi.content.claim.max.appendable.size=1 MB
> nifi.content.claim.max.flow.files=100
> nifi.content.repository.directory.default=/var/nifi/repositories/content
> nifi.content.repository.archive.max.retention.period=12 hours
> nifi.content.repository.archive.max.usage.percentage=50%
> nifi.content.repository.archive.enabled=true
> nifi.content.repository.always.sync=false
>
>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>
>
> There's 1024 folders on the disk (0-1023) for the Content Claims.
> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
> because I thought the max appendable size would make this no larger than
> 1MB.)
>
> Is there a way to expand the number of folders and/or reduce the amount of
> individual FlowFiles that are stored in the claims?
>
> I'm hoping there might be a best practice out there though.
>
> Thanks,
> Ryan
>
>