You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by "Peter Wicks (pwicks)" <pw...@micron.com> on 2019/07/31 17:53:29 UTC

RE: [EXT] Re: Duplicate flow files *without* their content

Lars,

If you are worried about it, using ReplaceText will have the same effect as your custom solution. When ReplaceText has it's `Replacement Strategy` set to `Always Replace` it doesn't read the contents of the FlowFile and simply writes out the replacement Value, which in your case could be an empty string.

Thanks,
  Peter

From: Lars Winderling <la...@posteo.de>
Sent: Wednesday, July 31, 2019 11:02 AM
To: dev@nifi.apache.org
Subject: [EXT] Re: Duplicate flow files *without* their content

Hi Edward,

thank you for your input. I didn't know about the cow-semantics, that's really useful. I'll check out the in-depth guide for sure!
In my case, the content of the flow file does change heavily from one processor to the next one, so I doubt copy-on-write would help here.

Best,
Lars

On Wed, 2019-07-31 at 12:13 +0100, Edward Armes wrote:

HI Lars,



In short. depending on the how a FlowFile is duplicated, the content

shouldn't be duplicated as well.



In general, content is only duplicated when it has been deemed to have been

changed (copy-on-write semantics). For the most part (unless a FlowFIle has

a large number of attributes) a FlowFile is actually quite small and

therefore the waste is minimal, hence why they can be held in memory and

passed through a Flow.



The best way to branch/clone a flow file is to add another output from the

processor you want to log the output from, and the Framework that surrounds

a Processor will handle the rest. This does create a duplicate FlowFIle but

doesn't create a copy of the content. In the provenance repository this

marked as a CLONE event for the original FlowFIle and the new FlowFile gets

treated as it's own unique FlowFIle with a reference to the original

content.



This is quite a short explanation, and a better and more in depth

explanation can be found here and I think this covers all the scenarios

you're thinking about:
<https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html>

https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html


.





Edward



On Wed, Jul 31, 2019 at 11:47 AM Lars Winderling <
<ma...@posteo.de>

lars.winderling@posteo.de<ma...@posteo.de>


>

wrote:



Dear NiFi community,



I often face the use-case where I import flow files with content of order

O(1gb) or O(10gb) - already compressed.

Let's day I need to branch off of a flow where the actual flow file should

be processed further, and one some side branch I want just to do some kind

of logging or whatever without accessing the flow file's contents. Thus

it's clearly wasteful to duplicate the flow file including content.

For this case I wrote a processor defining 2 relationships: "original" and

"attributes only", so the flow file attributes can be accessed separately

from the content.

I will gladly prepare a PR if anyone finds that worth incorporating into

NiFi.

Only remaining question for me would be: use an individual processor to

that end, or add it to e.g. the DuplicateFlowFile processor. The former

seems cleaner to me. Proposed names would be something like ForkProcessor

(no better idea yet).



Thanks in advance!

Best,

Lars



Re: [EXT] Re: Duplicate flow files *without* their content

Posted by Lars Winderling <la...@posteo.de>.
Hi Peter,
took me some time to understand your suggestion. Great, thank you!
Have a great day and take care.Best,Lars
On Wed, 2019-07-31 at 17:53 +0000, Peter Wicks (pwicks) wrote:
> Lars,
> If you are worried about it, using ReplaceText will have the same effect as your custom solution. When ReplaceText has
> it's `Replacement Strategy` set to `Always Replace` it doesn't read the contents of the FlowFile and simply writes out
> the replacement Value, which in your case could be an empty string.
> Thanks,  Peter
> From: Lars Winderling <la...@posteo.de>Sent: Wednesday, July 31, 2019 11:02 AMTo: dev@nifi.apache.org
> Subject: [EXT] Re: Duplicate flow files *without* their content
> Hi Edward,
> thank you for your input. I didn't know about the cow-semantics, that's really useful. I'll check out the in-depth
> guide for sure!In my case, the content of the flow file does change heavily from one processor to the next one, so I
> doubt copy-on-write would help here.
> Best,Lars
> On Wed, 2019-07-31 at 12:13 +0100, Edward Armes wrote:
> HI Lars,
> 
> 
> In short. depending on the how a FlowFile is duplicated, the content
> shouldn't be duplicated as well.
> 
> 
> In general, content is only duplicated when it has been deemed to have been
> changed (copy-on-write semantics). For the most part (unless a FlowFIle has
> a large number of attributes) a FlowFile is actually quite small and
> therefore the waste is minimal, hence why they can be held in memory and
> passed through a Flow.
> 
> 
> The best way to branch/clone a flow file is to add another output from the
> processor you want to log the output from, and the Framework that surrounds
> a Processor will handle the rest. This does create a duplicate FlowFIle but
> doesn't create a copy of the content. In the provenance repository this
> marked as a CLONE event for the original FlowFIle and the new FlowFile gets
> treated as it's own unique FlowFIle with a reference to the original
> content.
> 
> 
> This is quite a short explanation, and a better and more in depth
> explanation can be found here and I think this covers all the scenarios
> you're thinking about:<https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html>
> https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
> 
> 
> .
> 
> 
> 
> 
> Edward
> 
> 
> On Wed, Jul 31, 2019 at 11:47 AM Lars Winderling <<m...@posteo.de>
> lars.winderling@posteo.de<ma...@posteo.de>
> 
> 
> wrote:
> 
> 
> Dear NiFi community,
> 
> 
> I often face the use-case where I import flow files with content of order
> O(1gb) or O(10gb) - already compressed.
> Let's day I need to branch off of a flow where the actual flow file should
> be processed further, and one some side branch I want just to do some kind
> of logging or whatever without accessing the flow file's contents. Thus
> it's clearly wasteful to duplicate the flow file including content.
> For this case I wrote a processor defining 2 relationships: "original" and
> "attributes only", so the flow file attributes can be accessed separately
> from the content.
> I will gladly prepare a PR if anyone finds that worth incorporating into
> NiFi.
> Only remaining question for me would be: use an individual processor to
> that end, or add it to e.g. the DuplicateFlowFile processor. The former
> seems cleaner to me. Proposed names would be something like ForkProcessor
> (no better idea yet).
> 
> 
> Thanks in advance!
> Best,
> Lars
>