You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Joe Witt <jo...@gmail.com> on 2015/11/03 10:52:56 UTC

Re: Proposal: New file processors: GetFIleData and PutFileData

Rick,

I am finally taking a moment to clear out some dangling threads.  I
just looked into this one and the link appears to be gone.  Have you
chosen to withdraw this proposal at this time?

Thanks
Joe

On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <rb...@softnas.com> wrote:
> Yes.  Replication of directory tree via Nifi similar to rsync.
>
> -----Original Message-----
> From: Joe Skora [mailto:jskora@gmail.com]
> Sent: Thursday, September 24, 2015 10:16 PM
> To: dev@nifi.apache.org
> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>
> It may be an oversimplification, but for the purposes of understanding, is the intent to mirror directory tree with NiFi similar to rsync?
>
> On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <rb...@softnas.com> wrote:
>
>> Joe,
>>
>> Thanks for the quick response.
>>
>> Yes, I can add to the Wiki once access has been granted. Further responses:
>>
>> >> GetFile and PutFile do support recursive walking/reconstruction
>> >> based
>> on relative paths
>>
>> Based on my recent testing of 0.3.0, GetFile does walk the configured
>> directory tree, picking up the files it finds; however, only files are
>> sent to PutFile, which places them all into a single target folder
>> (not a directory tree - no directory information is sent by GetFile
>> nor processed by PutFile from what I have seen, so I do not believe it
>> reconstructs the directory tree at all today).
>>
>> >> I do think your proposal modified to consider the design pattern of
>> ListFile/FetchFile would be super powerful.
>>
>> We have another processor GetFileList that uses "find" to traverse a
>> target folder tree and feeds the resulting newline delimited
>> file/directory stream as FlowFiles into GetFileData.  Perhaps that
>> processor could be evolved into a suitable ListFiles processor.
>>
>> I believe GetFileList/GetFileData correspond roughly to the
>> ListFile/FetchFile concept, based on a cursory review of
>> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously
>> trivial at this point.  I'm assuming there are other facets to that
>> List/Fetch design pattern - is it documented anywhere I can review to learn more?
>>
>> So when we have a ListFile/FetchFile what is the corresponding "Put"
>> side of the flow to be?  Perhaps simply PutFile enhanced to handle
>> FlowFiles from both basic GetFile and the richer FetchFile (modified
>> GetFileData) types of FlowFiles and behaviors would suffice.
>>
>> >> Just need to make sure backpressure works through the flow so that
>> >> you
>> could literally handle the delivery of a file which is of itself
>> larger than the repo by capturing and sending a chunk of it at a time for instance.
>>
>> Agreed. Are there any best practices documented for configuring
>> backpressure properly?
>>
>> Thanks.
>>
>> Rick
>>
>> -----Original Message-----
>> From: Joe Witt [mailto:joe.witt@gmail.com]
>> Sent: Wednesday, September 23, 2015 6:25 PM
>> To: dev@nifi.apache.org
>> Subject: Re: Proposal: New file processors: GetFIleData and
>> PutFileData
>>
>> Rick
>>
>> This is a perfectly fine place to start the thread.  If you'd like to
>> create a wiki feature proposal for it too like we're doing with a lot
>> of the other things at this level we can give you access to create one
>> here [1].
>>
>> Not at all trying to take away from the points you were making but
>> GetFile and PutFile do support recursive walking/reconstruction based
>> on relative paths.  By no means is that as comprehensive as you're
>> going for here though - just an FYI.
>>
>> These sound like good things.  In particular I find your concept for
>> handling arbitrarily large data interesting.  Just need to make sure
>> backpressure works through the flow so that you could literally handle
>> the delivery of a file which is of itself larger than the repo by
>> capturing and sending a chunk of it at a time for instance.  So from a
>> brief historical perspective the GetFile / PutFile processors were
>> literally the first two processors ever build for NiFi back when it
>> had no GUI, no provenance, no nothin' that was cool.  These are the
>> OGs of NiFi.  They been improved a bit over the years but not much.
>> Why?  Because their utility was largely limited to trivial archiving
>> cases.  We have recently had discussions about making them more
>> powerful through the concept of ListFile/FetchFile like adam mentions
>> and as we've started doing with things like HDFS.  A much better model
>> for sure.  Still not as powerful as what you're cooking up though.  I
>> do think your proposal modified to consider the design pattern of
>> ListFile/FetchFile would be super powerful.  In your case ListFile for
>> a single larger file for instance could produce N listings that point
>> to the same file on disk but for different offset/ranges.  This would
>> be *very* interesting.  I am a bit concerned about how to have this
>> nicely handle competing consumer problems but...we can cross that bridge later.
>>
>> If you're willing to tackle this we can definitely work with you to
>> bring it in.  It is a non-trivial contribution for sure.  Folks often
>> do not consider all the nasty gotchas that can occur in something as
>> seemingly simple as File IO.
>>
>> Thanks
>> Joe
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal
>> s
>>
>> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <rb...@softnas.com> wrote:
>> > This thread proposes community review/comments of modified versions
>> > of
>> GetFile and PutFile for potential future adoption by the Nifi community.
>> For those who want to jump straight to the code, here's the review
>> repository location for the current version:
>> https://github.com/rickbraddy/nifishare.
>> >
>> > As background, we needed a way to replicate entire directory trees
>> > of
>> files via Nifi, where multiple directory trees can be specified at
>> run-time as part of an overall Nifi graph. As Nifi is rooted in
>> file-based processing, it seems reasonable to continue advancing its
>> abilities to ingest, process, transform and replicate files in the
>> most flexible manner possible.  While this proposal is not a be all
>> end all in that regard, it moves the needle in the right direction by
>> making file-processing in Nifi more dynamic, enabling flows to
>> determine how files (and directories) should be processed, which does
>> well beyond today's basic file ingress/egress process capabilities
>> (which certainly have their place and uses).  Whether it's via this
>> proposal and code or another, clearly Nifi can benefit from this type of functionality.
>> >
>> > Here's a more detailed explanation of the rationale for developing
>> > these
>> Nifi file processor derivatives and their initial implementation:
>> >
>> > GetFileData
>> > ----------------
>> > The GetFile processor monitors a single directory tree for file
>> > changes
>> and creates FlowFiles for every changed file in that configured tree.
>> It does a good job of getting files from a configurable folder than
>> need to be injected into a graph. GetFile falls short of other
>> requirements that arise for general-purpose file processing:
>> >
>> > -          Operates from a single, pre-configured source directory (not
>> dynamically configurable at run-time as part of a flow)
>> >
>> > -          Scheduled on a periodic basis only, not event-triggered when
>> there's something to do
>> >
>> > -          Does not support sending an entire directory tree (only files
>> are sent, not directories)
>> >
>> > -          Is a "source" processor node only, cannot be used within
>> other Nifi flow logic that dynamically determines which files or
>> directories to get and send as FlowFiles
>> >
>> > -          Assumes each file is smaller than the content repository,
>> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
>> dominate the content repository
>> >
>> > A modified version of GetFile (currently) named GetFileData has been
>> developed and is proposed as the basis for a new Nifi processor that
>> will supplement file ingestion with these features:
>> >
>> > -          Operates based upon inbound FlowFiles that contains the
>> filesystem path to a file or directory
>> >
>> > -          Scheduled by incoming FlowFiles containing a file or
>> directory path, only runs when there's something to do
>> >
>> > -          Supports sending directory tree as a series of directory and
>> file paths; e.g., ExecuteProcess("find /mypath -print") =>
>> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
>> GetFIleData ...
>> >
>> > -          Participates within simple or complex flows to fetch and send
>> files and directories
>> >
>> > -          (To be developed) Is designed to handle any size file, by
>> breaking files larger than a "chunkingThreshold" into a series of
>> multiple smaller files that can be reassembled on the other end (by
>> PutFileData)
>> >
>> > PutFileData
>> > ---------------
>> > The PutFile processor accepts incoming FlowFiles and writes those
>> > files
>> to a single target directory.  It does a good job of handling and
>> resolving conflicts, but falls short of other requirements that arise
>> for general-purpose file processing:
>> >
>> > -          Does not support directories, only files
>> >
>> > -          Only supports a single, preconfigured target directory
>> >
>> > -          Cannot reconstruct and entire directory tree based upon
>> relative file paths (all files go into a single target directory)
>> >
>> > -          Assumes each file is small enough to fit into the content
>> repository
>> >
>> > A modified version of PutFile (currently) named PutFileData has been
>> developed and is proposed as the basis for a new Nifi processor that
>> will supplement file egress with these features:
>> >
>> > -          Supports directories and files
>> >
>> > -          Supports reconstruction of entire directory tree based upon
>> relative file paths, enabling reconstruction of an entire directory
>> free originating from GetFileData
>> >
>> > -          (To be developed) Is designed to handle any size file, by
>> reassembling multi-part files into very large files (TB's) that do not
>> fit within the content repository
>> >
>> > Should the community have an interest in these processors (we can
>> > name
>> them something different, if needed), these contributions are now
>> available.  In the meantime, we shall continue developing these
>> processor to meet our specific use cases, adding the chunking
>> functionality and QA certifying them for production use at scale.
>> >
>> > Looking forward to comments, feedback and recommendations.
>> >
>> > Here's the Github repo link again:
>> > https://github.com/rickbraddy/nifishare
>> >
>> > Best,
>> > Rick
>> >
>> > P.S. If there's a better vehicle for communicating these types of
>> proposals, please advise.
>> >
>> >
>>

Re: Proposal: New file processors: GetFIleData and PutFileData

Posted by Joe Witt <jo...@gmail.com>.
Sounds good.  For future purposes documentation on the processes which
work best (so far) can be found here:
https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide

By creating a JIRA and attaching a patch in patch reviewed state that
helps for sure.  As does the Github PR process.  Both create a sort of
'pull' that allows the community to then work on items as available.
There are some List/Fetch items being worked so perhaps some of your
ideas will be addressed there.

Thanks
Joe

On Tue, Nov 3, 2015 at 10:49 AM, Rick Braddy <rb...@softnas.com> wrote:
> There was no interest shown by the community so we moved on.
>
>> On Nov 3, 2015, at 3:53 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>> Rick,
>>
>> I am finally taking a moment to clear out some dangling threads.  I
>> just looked into this one and the link appears to be gone.  Have you
>> chosen to withdraw this proposal at this time?
>>
>> Thanks
>> Joe
>>
>>> On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <rb...@softnas.com> wrote:
>>> Yes.  Replication of directory tree via Nifi similar to rsync.
>>>
>>> -----Original Message-----
>>> From: Joe Skora [mailto:jskora@gmail.com]
>>> Sent: Thursday, September 24, 2015 10:16 PM
>>> To: dev@nifi.apache.org
>>> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>>>
>>> It may be an oversimplification, but for the purposes of understanding, is the intent to mirror directory tree with NiFi similar to rsync?
>>>
>>>> On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <rb...@softnas.com> wrote:
>>>>
>>>> Joe,
>>>>
>>>> Thanks for the quick response.
>>>>
>>>> Yes, I can add to the Wiki once access has been granted. Further responses:
>>>>
>>>>>> GetFile and PutFile do support recursive walking/reconstruction
>>>>>> based
>>>> on relative paths
>>>>
>>>> Based on my recent testing of 0.3.0, GetFile does walk the configured
>>>> directory tree, picking up the files it finds; however, only files are
>>>> sent to PutFile, which places them all into a single target folder
>>>> (not a directory tree - no directory information is sent by GetFile
>>>> nor processed by PutFile from what I have seen, so I do not believe it
>>>> reconstructs the directory tree at all today).
>>>>
>>>>>> I do think your proposal modified to consider the design pattern of
>>>> ListFile/FetchFile would be super powerful.
>>>>
>>>> We have another processor GetFileList that uses "find" to traverse a
>>>> target folder tree and feeds the resulting newline delimited
>>>> file/directory stream as FlowFiles into GetFileData.  Perhaps that
>>>> processor could be evolved into a suitable ListFiles processor.
>>>>
>>>> I believe GetFileList/GetFileData correspond roughly to the
>>>> ListFile/FetchFile concept, based on a cursory review of
>>>> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously
>>>> trivial at this point.  I'm assuming there are other facets to that
>>>> List/Fetch design pattern - is it documented anywhere I can review to learn more?
>>>>
>>>> So when we have a ListFile/FetchFile what is the corresponding "Put"
>>>> side of the flow to be?  Perhaps simply PutFile enhanced to handle
>>>> FlowFiles from both basic GetFile and the richer FetchFile (modified
>>>> GetFileData) types of FlowFiles and behaviors would suffice.
>>>>
>>>>>> Just need to make sure backpressure works through the flow so that
>>>>>> you
>>>> could literally handle the delivery of a file which is of itself
>>>> larger than the repo by capturing and sending a chunk of it at a time for instance.
>>>>
>>>> Agreed. Are there any best practices documented for configuring
>>>> backpressure properly?
>>>>
>>>> Thanks.
>>>>
>>>> Rick
>>>>
>>>> -----Original Message-----
>>>> From: Joe Witt [mailto:joe.witt@gmail.com]
>>>> Sent: Wednesday, September 23, 2015 6:25 PM
>>>> To: dev@nifi.apache.org
>>>> Subject: Re: Proposal: New file processors: GetFIleData and
>>>> PutFileData
>>>>
>>>> Rick
>>>>
>>>> This is a perfectly fine place to start the thread.  If you'd like to
>>>> create a wiki feature proposal for it too like we're doing with a lot
>>>> of the other things at this level we can give you access to create one
>>>> here [1].
>>>>
>>>> Not at all trying to take away from the points you were making but
>>>> GetFile and PutFile do support recursive walking/reconstruction based
>>>> on relative paths.  By no means is that as comprehensive as you're
>>>> going for here though - just an FYI.
>>>>
>>>> These sound like good things.  In particular I find your concept for
>>>> handling arbitrarily large data interesting.  Just need to make sure
>>>> backpressure works through the flow so that you could literally handle
>>>> the delivery of a file which is of itself larger than the repo by
>>>> capturing and sending a chunk of it at a time for instance.  So from a
>>>> brief historical perspective the GetFile / PutFile processors were
>>>> literally the first two processors ever build for NiFi back when it
>>>> had no GUI, no provenance, no nothin' that was cool.  These are the
>>>> OGs of NiFi.  They been improved a bit over the years but not much.
>>>> Why?  Because their utility was largely limited to trivial archiving
>>>> cases.  We have recently had discussions about making them more
>>>> powerful through the concept of ListFile/FetchFile like adam mentions
>>>> and as we've started doing with things like HDFS.  A much better model
>>>> for sure.  Still not as powerful as what you're cooking up though.  I
>>>> do think your proposal modified to consider the design pattern of
>>>> ListFile/FetchFile would be super powerful.  In your case ListFile for
>>>> a single larger file for instance could produce N listings that point
>>>> to the same file on disk but for different offset/ranges.  This would
>>>> be *very* interesting.  I am a bit concerned about how to have this
>>>> nicely handle competing consumer problems but...we can cross that bridge later.
>>>>
>>>> If you're willing to tackle this we can definitely work with you to
>>>> bring it in.  It is a non-trivial contribution for sure.  Folks often
>>>> do not consider all the nasty gotchas that can occur in something as
>>>> seemingly simple as File IO.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal
>>>> s
>>>>
>>>>> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <rb...@softnas.com> wrote:
>>>>> This thread proposes community review/comments of modified versions
>>>>> of
>>>> GetFile and PutFile for potential future adoption by the Nifi community.
>>>> For those who want to jump straight to the code, here's the review
>>>> repository location for the current version:
>>>> https://github.com/rickbraddy/nifishare.
>>>>>
>>>>> As background, we needed a way to replicate entire directory trees
>>>>> of
>>>> files via Nifi, where multiple directory trees can be specified at
>>>> run-time as part of an overall Nifi graph. As Nifi is rooted in
>>>> file-based processing, it seems reasonable to continue advancing its
>>>> abilities to ingest, process, transform and replicate files in the
>>>> most flexible manner possible.  While this proposal is not a be all
>>>> end all in that regard, it moves the needle in the right direction by
>>>> making file-processing in Nifi more dynamic, enabling flows to
>>>> determine how files (and directories) should be processed, which does
>>>> well beyond today's basic file ingress/egress process capabilities
>>>> (which certainly have their place and uses).  Whether it's via this
>>>> proposal and code or another, clearly Nifi can benefit from this type of functionality.
>>>>>
>>>>> Here's a more detailed explanation of the rationale for developing
>>>>> these
>>>> Nifi file processor derivatives and their initial implementation:
>>>>>
>>>>> GetFileData
>>>>> ----------------
>>>>> The GetFile processor monitors a single directory tree for file
>>>>> changes
>>>> and creates FlowFiles for every changed file in that configured tree.
>>>> It does a good job of getting files from a configurable folder than
>>>> need to be injected into a graph. GetFile falls short of other
>>>> requirements that arise for general-purpose file processing:
>>>>>
>>>>> -          Operates from a single, pre-configured source directory (not
>>>> dynamically configurable at run-time as part of a flow)
>>>>>
>>>>> -          Scheduled on a periodic basis only, not event-triggered when
>>>> there's something to do
>>>>>
>>>>> -          Does not support sending an entire directory tree (only files
>>>> are sent, not directories)
>>>>>
>>>>> -          Is a "source" processor node only, cannot be used within
>>>> other Nifi flow logic that dynamically determines which files or
>>>> directories to get and send as FlowFiles
>>>>>
>>>>> -          Assumes each file is smaller than the content repository,
>>>> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
>>>> dominate the content repository
>>>>>
>>>>> A modified version of GetFile (currently) named GetFileData has been
>>>> developed and is proposed as the basis for a new Nifi processor that
>>>> will supplement file ingestion with these features:
>>>>>
>>>>> -          Operates based upon inbound FlowFiles that contains the
>>>> filesystem path to a file or directory
>>>>>
>>>>> -          Scheduled by incoming FlowFiles containing a file or
>>>> directory path, only runs when there's something to do
>>>>>
>>>>> -          Supports sending directory tree as a series of directory and
>>>> file paths; e.g., ExecuteProcess("find /mypath -print") =>
>>>> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
>>>> GetFIleData ...
>>>>>
>>>>> -          Participates within simple or complex flows to fetch and send
>>>> files and directories
>>>>>
>>>>> -          (To be developed) Is designed to handle any size file, by
>>>> breaking files larger than a "chunkingThreshold" into a series of
>>>> multiple smaller files that can be reassembled on the other end (by
>>>> PutFileData)
>>>>>
>>>>> PutFileData
>>>>> ---------------
>>>>> The PutFile processor accepts incoming FlowFiles and writes those
>>>>> files
>>>> to a single target directory.  It does a good job of handling and
>>>> resolving conflicts, but falls short of other requirements that arise
>>>> for general-purpose file processing:
>>>>>
>>>>> -          Does not support directories, only files
>>>>>
>>>>> -          Only supports a single, preconfigured target directory
>>>>>
>>>>> -          Cannot reconstruct and entire directory tree based upon
>>>> relative file paths (all files go into a single target directory)
>>>>>
>>>>> -          Assumes each file is small enough to fit into the content
>>>> repository
>>>>>
>>>>> A modified version of PutFile (currently) named PutFileData has been
>>>> developed and is proposed as the basis for a new Nifi processor that
>>>> will supplement file egress with these features:
>>>>>
>>>>> -          Supports directories and files
>>>>>
>>>>> -          Supports reconstruction of entire directory tree based upon
>>>> relative file paths, enabling reconstruction of an entire directory
>>>> free originating from GetFileData
>>>>>
>>>>> -          (To be developed) Is designed to handle any size file, by
>>>> reassembling multi-part files into very large files (TB's) that do not
>>>> fit within the content repository
>>>>>
>>>>> Should the community have an interest in these processors (we can
>>>>> name
>>>> them something different, if needed), these contributions are now
>>>> available.  In the meantime, we shall continue developing these
>>>> processor to meet our specific use cases, adding the chunking
>>>> functionality and QA certifying them for production use at scale.
>>>>>
>>>>> Looking forward to comments, feedback and recommendations.
>>>>>
>>>>> Here's the Github repo link again:
>>>>> https://github.com/rickbraddy/nifishare
>>>>>
>>>>> Best,
>>>>> Rick
>>>>>
>>>>> P.S. If there's a better vehicle for communicating these types of
>>>> proposals, please advise.
>>>>

Re: Proposal: New file processors: GetFIleData and PutFileData

Posted by Rick Braddy <rb...@softnas.com>.
There was no interest shown by the community so we moved on.

> On Nov 3, 2015, at 3:53 AM, Joe Witt <jo...@gmail.com> wrote:
> 
> Rick,
> 
> I am finally taking a moment to clear out some dangling threads.  I
> just looked into this one and the link appears to be gone.  Have you
> chosen to withdraw this proposal at this time?
> 
> Thanks
> Joe
> 
>> On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <rb...@softnas.com> wrote:
>> Yes.  Replication of directory tree via Nifi similar to rsync.
>> 
>> -----Original Message-----
>> From: Joe Skora [mailto:jskora@gmail.com]
>> Sent: Thursday, September 24, 2015 10:16 PM
>> To: dev@nifi.apache.org
>> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>> 
>> It may be an oversimplification, but for the purposes of understanding, is the intent to mirror directory tree with NiFi similar to rsync?
>> 
>>> On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <rb...@softnas.com> wrote:
>>> 
>>> Joe,
>>> 
>>> Thanks for the quick response.
>>> 
>>> Yes, I can add to the Wiki once access has been granted. Further responses:
>>> 
>>>>> GetFile and PutFile do support recursive walking/reconstruction
>>>>> based
>>> on relative paths
>>> 
>>> Based on my recent testing of 0.3.0, GetFile does walk the configured
>>> directory tree, picking up the files it finds; however, only files are
>>> sent to PutFile, which places them all into a single target folder
>>> (not a directory tree - no directory information is sent by GetFile
>>> nor processed by PutFile from what I have seen, so I do not believe it
>>> reconstructs the directory tree at all today).
>>> 
>>>>> I do think your proposal modified to consider the design pattern of
>>> ListFile/FetchFile would be super powerful.
>>> 
>>> We have another processor GetFileList that uses "find" to traverse a
>>> target folder tree and feeds the resulting newline delimited
>>> file/directory stream as FlowFiles into GetFileData.  Perhaps that
>>> processor could be evolved into a suitable ListFiles processor.
>>> 
>>> I believe GetFileList/GetFileData correspond roughly to the
>>> ListFile/FetchFile concept, based on a cursory review of
>>> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously
>>> trivial at this point.  I'm assuming there are other facets to that
>>> List/Fetch design pattern - is it documented anywhere I can review to learn more?
>>> 
>>> So when we have a ListFile/FetchFile what is the corresponding "Put"
>>> side of the flow to be?  Perhaps simply PutFile enhanced to handle
>>> FlowFiles from both basic GetFile and the richer FetchFile (modified
>>> GetFileData) types of FlowFiles and behaviors would suffice.
>>> 
>>>>> Just need to make sure backpressure works through the flow so that
>>>>> you
>>> could literally handle the delivery of a file which is of itself
>>> larger than the repo by capturing and sending a chunk of it at a time for instance.
>>> 
>>> Agreed. Are there any best practices documented for configuring
>>> backpressure properly?
>>> 
>>> Thanks.
>>> 
>>> Rick
>>> 
>>> -----Original Message-----
>>> From: Joe Witt [mailto:joe.witt@gmail.com]
>>> Sent: Wednesday, September 23, 2015 6:25 PM
>>> To: dev@nifi.apache.org
>>> Subject: Re: Proposal: New file processors: GetFIleData and
>>> PutFileData
>>> 
>>> Rick
>>> 
>>> This is a perfectly fine place to start the thread.  If you'd like to
>>> create a wiki feature proposal for it too like we're doing with a lot
>>> of the other things at this level we can give you access to create one
>>> here [1].
>>> 
>>> Not at all trying to take away from the points you were making but
>>> GetFile and PutFile do support recursive walking/reconstruction based
>>> on relative paths.  By no means is that as comprehensive as you're
>>> going for here though - just an FYI.
>>> 
>>> These sound like good things.  In particular I find your concept for
>>> handling arbitrarily large data interesting.  Just need to make sure
>>> backpressure works through the flow so that you could literally handle
>>> the delivery of a file which is of itself larger than the repo by
>>> capturing and sending a chunk of it at a time for instance.  So from a
>>> brief historical perspective the GetFile / PutFile processors were
>>> literally the first two processors ever build for NiFi back when it
>>> had no GUI, no provenance, no nothin' that was cool.  These are the
>>> OGs of NiFi.  They been improved a bit over the years but not much.
>>> Why?  Because their utility was largely limited to trivial archiving
>>> cases.  We have recently had discussions about making them more
>>> powerful through the concept of ListFile/FetchFile like adam mentions
>>> and as we've started doing with things like HDFS.  A much better model
>>> for sure.  Still not as powerful as what you're cooking up though.  I
>>> do think your proposal modified to consider the design pattern of
>>> ListFile/FetchFile would be super powerful.  In your case ListFile for
>>> a single larger file for instance could produce N listings that point
>>> to the same file on disk but for different offset/ranges.  This would
>>> be *very* interesting.  I am a bit concerned about how to have this
>>> nicely handle competing consumer problems but...we can cross that bridge later.
>>> 
>>> If you're willing to tackle this we can definitely work with you to
>>> bring it in.  It is a non-trivial contribution for sure.  Folks often
>>> do not consider all the nasty gotchas that can occur in something as
>>> seemingly simple as File IO.
>>> 
>>> Thanks
>>> Joe
>>> 
>>> [1]
>>> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal
>>> s
>>> 
>>>> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <rb...@softnas.com> wrote:
>>>> This thread proposes community review/comments of modified versions
>>>> of
>>> GetFile and PutFile for potential future adoption by the Nifi community.
>>> For those who want to jump straight to the code, here's the review
>>> repository location for the current version:
>>> https://github.com/rickbraddy/nifishare.
>>>> 
>>>> As background, we needed a way to replicate entire directory trees
>>>> of
>>> files via Nifi, where multiple directory trees can be specified at
>>> run-time as part of an overall Nifi graph. As Nifi is rooted in
>>> file-based processing, it seems reasonable to continue advancing its
>>> abilities to ingest, process, transform and replicate files in the
>>> most flexible manner possible.  While this proposal is not a be all
>>> end all in that regard, it moves the needle in the right direction by
>>> making file-processing in Nifi more dynamic, enabling flows to
>>> determine how files (and directories) should be processed, which does
>>> well beyond today's basic file ingress/egress process capabilities
>>> (which certainly have their place and uses).  Whether it's via this
>>> proposal and code or another, clearly Nifi can benefit from this type of functionality.
>>>> 
>>>> Here's a more detailed explanation of the rationale for developing
>>>> these
>>> Nifi file processor derivatives and their initial implementation:
>>>> 
>>>> GetFileData
>>>> ----------------
>>>> The GetFile processor monitors a single directory tree for file
>>>> changes
>>> and creates FlowFiles for every changed file in that configured tree.
>>> It does a good job of getting files from a configurable folder than
>>> need to be injected into a graph. GetFile falls short of other
>>> requirements that arise for general-purpose file processing:
>>>> 
>>>> -          Operates from a single, pre-configured source directory (not
>>> dynamically configurable at run-time as part of a flow)
>>>> 
>>>> -          Scheduled on a periodic basis only, not event-triggered when
>>> there's something to do
>>>> 
>>>> -          Does not support sending an entire directory tree (only files
>>> are sent, not directories)
>>>> 
>>>> -          Is a "source" processor node only, cannot be used within
>>> other Nifi flow logic that dynamically determines which files or
>>> directories to get and send as FlowFiles
>>>> 
>>>> -          Assumes each file is smaller than the content repository,
>>> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
>>> dominate the content repository
>>>> 
>>>> A modified version of GetFile (currently) named GetFileData has been
>>> developed and is proposed as the basis for a new Nifi processor that
>>> will supplement file ingestion with these features:
>>>> 
>>>> -          Operates based upon inbound FlowFiles that contains the
>>> filesystem path to a file or directory
>>>> 
>>>> -          Scheduled by incoming FlowFiles containing a file or
>>> directory path, only runs when there's something to do
>>>> 
>>>> -          Supports sending directory tree as a series of directory and
>>> file paths; e.g., ExecuteProcess("find /mypath -print") =>
>>> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
>>> GetFIleData ...
>>>> 
>>>> -          Participates within simple or complex flows to fetch and send
>>> files and directories
>>>> 
>>>> -          (To be developed) Is designed to handle any size file, by
>>> breaking files larger than a "chunkingThreshold" into a series of
>>> multiple smaller files that can be reassembled on the other end (by
>>> PutFileData)
>>>> 
>>>> PutFileData
>>>> ---------------
>>>> The PutFile processor accepts incoming FlowFiles and writes those
>>>> files
>>> to a single target directory.  It does a good job of handling and
>>> resolving conflicts, but falls short of other requirements that arise
>>> for general-purpose file processing:
>>>> 
>>>> -          Does not support directories, only files
>>>> 
>>>> -          Only supports a single, preconfigured target directory
>>>> 
>>>> -          Cannot reconstruct and entire directory tree based upon
>>> relative file paths (all files go into a single target directory)
>>>> 
>>>> -          Assumes each file is small enough to fit into the content
>>> repository
>>>> 
>>>> A modified version of PutFile (currently) named PutFileData has been
>>> developed and is proposed as the basis for a new Nifi processor that
>>> will supplement file egress with these features:
>>>> 
>>>> -          Supports directories and files
>>>> 
>>>> -          Supports reconstruction of entire directory tree based upon
>>> relative file paths, enabling reconstruction of an entire directory
>>> free originating from GetFileData
>>>> 
>>>> -          (To be developed) Is designed to handle any size file, by
>>> reassembling multi-part files into very large files (TB's) that do not
>>> fit within the content repository
>>>> 
>>>> Should the community have an interest in these processors (we can
>>>> name
>>> them something different, if needed), these contributions are now
>>> available.  In the meantime, we shall continue developing these
>>> processor to meet our specific use cases, adding the chunking
>>> functionality and QA certifying them for production use at scale.
>>>> 
>>>> Looking forward to comments, feedback and recommendations.
>>>> 
>>>> Here's the Github repo link again:
>>>> https://github.com/rickbraddy/nifishare
>>>> 
>>>> Best,
>>>> Rick
>>>> 
>>>> P.S. If there's a better vehicle for communicating these types of
>>> proposals, please advise.
>>>