You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by scott <tc...@gmail.com> on 2018/03/27 03:58:29 UTC

ListSFTP incoming relationship

Hello Devs,

I would like to request a feature to a major processor, ListSFTP. But 
before I do down the official road, I wanted to ask if anyone thought it 
was a terrible idea or impossible, etc. The request is to add support 
for an incoming relationship to the ListSFTP processor specifically, but 
I could see it added to many of the commonly used head processes, such 
as ListFile. I would envision functionality more like InvokeHTTP or 
ExecuteSQL, where an incoming flow file could initiate the action, and 
the attributes in the incoming flow file could be used to configure the 
processor actions. It's the configuration aspect that most appeals to 
me, because it opens it up to being centrally or dynamically configured.

Thanks,

Scott

Re: ListSFTP incoming relationship

Posted by scott <tc...@gmail.com>.

Pierre,

That sounds good. I'll work on the requirements and create a Jira this 
week, so that I can get started.

Thanks to all for your feedback.


Scott


On 04/01/2018 10:06 AM, Pierre Villard wrote:
> Hi Scott,
>
> In my opinion, based on the discussion here, I'd suggest you to implement
> the solution that you seem best to answer your needs and also taking in
> consideration all the feedback the community provided. Once you have
> something, best is to submit a pull request so that review and discussion
> can move forward on the implementation itself. I'd also recommend to file a
> JIRA with as much details as possible on what is the need, what are the
> options on the table and what is the implementation you want to propose
> (the more technical details you give, the sooner you'll get feedback for
> your code).
>
> Pierre
>
>
>
> 2018-04-01 18:40 GMT+02:00 scott <tc...@gmail.com>:
>
>> Okay. I guess I didn't realize how Nifi dev felt about risk tolerance. I
>> think I can work around it by adding duplicate filtering or implement some
>> other state management solution.
>> So, what's the next step?
>>
>> Scott
>>
>> On Thu, Mar 29, 2018, 10:46 AM Bryan Bende <bb...@gmail.com> wrote:
>>
>>> Scott,
>>>
>>> You are correct that the overall discussion is about allowing incoming
>>> flow files to ListSFTP.
>>>
>>> However, the previous discussion on this thread highlighted that the
>>> main reason ListSFTP currently doesn't allow incoming flow files is
>>> because of challenges when storing state.
>>>
>>> This led to the proposal of a new processor that allowed incoming flow
>>> files, but did not store state, thus avoiding the challenges mentioned
>>> above. If we were going to store state in this new processor, then
>>> we'd be back to the exact same challenges.
>>>
>>> Providing an option to turn on state also doesn't really help, because
>>> if there is an option provided to users,then the option will be used,
>>> and it needs to work when it is used.
>>>
>>> If we can come up with something that stores state and works well for
>>> all scenarios, then we aren't against it, we just need to handle the
>>> challenges highlighted by Joe's original email.
>>>
>>> Regarding some of the other ideas...
>>>
>>> The current output of ListSFTP already includes flow file attributes
>>> for each listing that include the full path, filename, last update
>>> time, owner, group, permissions, and file size.... were you thinking
>>> of something different than that?
>>>
>>> See the "Writes Attributes" section here:
>>>
>>> https://nifi.apache.org/docs/nifi-docs/components/org.
>> apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.
>> processors.standard.ListSFTP/index.html
>>> Thanks,
>>>
>>> Bryan
>>>
>>>
>>>
>>> On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <al...@apache.org>
>>> wrote:
>>>> Scott,
>>>>
>>>> I think there are two conversations going on here. You are finding the
>>>> requirements for your specific use case, and that’s great. But I echo
>>>> Bryan’s point that a community processor for this scenario should not
>>> store
>>>> state at all. Sivaprasanna’s point that given dynamic directory input,
>>>> storing state based on that can cause massive data ingestion problems
>>> still
>>>> stands.
>>>>
>>>> For your specific use case, you can prototype (or possibly even get to
>> a
>>>> stable and robust-enough point) using ExecuteScript to model the
>> behavior
>>>> you need.
>>>>
>>>> In regards to the desired output format, I would suggest a few items:
>>>>
>>>> * Avro requires a schema to be defined, and this raises the barrier to
>>> use
>>>> of the processor. Also, unless being sent to a processor that
>> understands
>>>> Avro, the result will need to be converted anyway using Record*
>>> processors.
>>>> * If the output is individual flowfiles on a 1:1 basis, each should
>> have
>>> as
>>>> many attributes populated with the parsed information as possible (i.e.
>>>> file.name, file.path, file.size, file.owner, file.permissions, etc.).
>>> This
>>>> allows for easily-consumable and routable flowfiles.
>>>> * If the output is a full directory listing, I would suggest `ls -al`
>>> type
>>>> raw text output, or JSON (arbitrary human-readable and machine-readable
>>>> format with many consuming/transforming processors).
>>>>
>>>>
>>>> Andy LoPresto
>>>> alopresto@apache.org
>>>> alopresto.apache@gmail.com
>>>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>>>
>>>> On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
>>>>
>>>> Sorry Bryan, but I disagree with you. Not storing state is NOT the main
>>>> point of this new processor. The main point is to allow an incoming
>>>> relationship flowfile to trigger the action, and allow variables to be
>>> used
>>>> from the attributes therein.
>>>>
>>>> I agree that if the NiFi community deems it too risky to distribute
>> this
>>>> processor with state keeping optionally available, even if the default
>>> is to
>>>> disable it, then so be it. If state is not included optionally, then
>> how
>>>> about making the output flowfile content include more than just the
>> file
>>>> names? Have it include last updated time along with the filename. If it
>>>> searches recursively, you'll want to include the path to the file also.
>>>> Maybe it would be best to output the results into a structured format,
>>> such
>>>> as AVRO? Or, maybe it would just be best to output one flowfile per
>>> remote
>>>> file found, and include updated time and fully qualified path as
>>> attributes?
>>>> Scott
>>>>
>>>>
>>>> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>>>>
>>>> The main point of the new processor is to NOT store state so that it
>>>> becomes more reasonable to allow incoming flow files.
>>>>
>>>> You could probably implement your own custom processor that does both
>>>> because you can make assumptions about how you are going to use it, but
>>> if
>>>> the NiFi community provides one then it needs to work well for all
>>>> situations, such as dynamically listing hundreds of directories, which
>> is
>>>> problematic when state is involved.
>>>>
>>>> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <
>> sivaprasanna246@gmail.com>
>>>> wrote:
>>>>
>>>> Should we really have to have an optional state saving functionality?
>> If
>>>> the user is unaware of the implications and proceed to store the state
>>> then
>>>> what Andrew Grande mentioned will happen - possibilities of never
>> ending
>>>> stream of state information being stored. If we still go with the
>>> optional
>>>> state management approach, documentation have to be clear in explaining
>>> the
>>>> implications.
>>>>
>>>> Sivaprasanna
>>>>
>>>> On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
>>>>
>>>> Okay. So, a new processor called "ScanSFTP", allow incoming
>> relationship
>>>> where the content of the flow file is replaced with the list of
>> matching
>>>> files from the remote directory, then the list is filtered by the usual
>>>> regex parameters like today. Optional state information is kept to
>>>> additionally filter the list of files older than the newest file
>>>> observed during the last run. Does that sound okay to everyone? If so,
>>>> what's the next step?
>>>>
>>>> Scott
>>>>
>>>>
>>>> On 03/27/2018 06:21 PM, scott wrote:
>>>>
>>>> This is a great discussion, and appreciate the interest in my problem.
>>>> I think there are workarounds if you decide not to store state, but
>>>> I'd recommend keeping it. I think state should be kept optionally,
>>>> even turned off by default. Several times I've had issues where the
>>>> state has cause me to miss files, because files get moved into the
>>>> source folder out of order, and I've wished I could turn the state
>>>> feature off.
>>>>
>>>> In my current use-case, I would not be frequently, dynamically
>>>> changing the source directory, though I can see the use-cases where it
>>>> would be. In my current use-case, I want to use an external database
>>>> table to control the configuration of all my flows. I do this by first
>>>> reading the content of the table for this particular flow ID, then
>>>> assign the result as attributes to the flowfile, essentially creating
>>>> variables I can use throughout the flow to control its behavior. This
>>>> works great with flows that initiate with HTTP or SQL, but not
>>>> ListSFTP or ListFile.
>>>>
>>>> Scott
>>>>
>>>>
>>>> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>>>>
>>>> I think Bryan’s point is a good one and when I first saw this
>>>> question (and thought of the previous times it’s been asked), my
>>>> initial response is to propose a second processor.
>>>>
>>>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
>>>> differently from ListSFTP — it does not maintain state, and performs
>>>> a one-time tabulation/chronicling of the state of that directory at
>>>> the given point in time.
>>>>
>>>> The responsibility to maintain and compare state across time is no
>>>> longer a requirement. There could even be a setting in the processor
>>>> to allow for “individual flowfile output” (i.e. act the same as
>>>> ListSFTP and output one flowfile per item listed) or “summary
>>>> flowfile output” where a single flowfile is generated containing the
>>>> directory listing information for all the items there. (Another
>>>> option is to output both on two different relationships).
>>>>
>>>> I think this would enable the types of workflows that users have
>>>> asked about in the past without compromising the mechanism by which
>>>> List* processors work and adding undue complexity to those processors.
>>>>
>>>> Absolutely crystal clear documentation (and a standard verb for the
>>>> new processor family) would be necessary (not only because these
>>>> processor solve different problems, but to avoid a million variants
>>>> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
>>>> provide a directory in an attribute to ListSFTP” mailing list
>>>> questions).
>>>>
>>>>
>>>> Andy LoPresto
>>>> alopresto@apache.org <ma...@apache.org>
>>>> /alopresto.apache@gmail.com <ma...@gmail.com>/
>>>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>>>
>>>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>> The key here is that ListXXX processor maintains state. A directory
>>>> is part
>>>> of such state. Allowing arbitrary directories via an expression would
>>>> create never ending stream of new entries in the state storage,
>>>> effectively
>>>> engineering a distributed DoS attack on the NiFi node or shared ZK
>>>> quorum
>>>> (for when state is stored in there).
>>>>
>>>> Maybe if we focus on thinking about assumptions and restrictions the
>>>> processor should make to contain that risk...
>>>>
>>>> Andrew
>>>>
>>>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>> I'm not sure that would solve the problem because you'd still be
>>>> limited to one directory. What most people are asking for is the
>>>> ability to use a dynamic directory from an incoming flow file.
>>>>
>>>> I think we might be trying to fit two different use-cases into one
>>>> processor which might not make sense.
>>>>
>>>> Scenario #1... There is a directory that is constantly receiving new
>>>> data and has a significant amount of files, and I want to
>>>>
>>>> periodically
>>>>
>>>> find new files. This is what the current processors are optimized
>>>>
>>>> for.
>>>>
>>>> Scenario #2... There is a directory that is mostly static with a
>>>> moderate/small number of files, and at points in my flow I want to
>>>> dynamically perform a listing of this directory and retrieve the
>>>> files. This is more geared towards the mentality of running a
>>>> job/workflow.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
>>>> <ottobackwards@gmail.com <ma...@gmail.com>>
>>>> wrote:
>>>>
>>>> What if the changes where ‘on top of’ some base set of properties,
>>>> like
>>>> directory?
>>>> Like a filter, where if present from the incoming file will have
>>>>
>>>> the
>>>>
>>>> LIST*
>>>>
>>>> list only things
>>>> that match a name or attribute?
>>>>
>>>>
>>>>
>>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
>>>> <ma...@gmail.com>) wrote:
>>>>
>>>> Scott
>>>>
>>>> This idea has come up a couple of times and there is definitely
>>>> something intriguing to it. Where I think this idea stalls out
>>>>
>>>> though
>>>>
>>>> is in implementation.
>>>>
>>>> While I agree that the other List* processors might similarly
>>>>
>>>> benefit
>>>>
>>>> lets focus on ListFile. Today you tell ListFile what directory to
>>>> start looking for files in. It goes off scanning that directory for
>>>> hits and stores state about what it has already searched/seen. And
>>>>
>>>> it
>>>>
>>>> is important to keep track of how much it has already scanned
>>>>
>>>> because
>>>>
>>>> at times the search directory can be massive (100,000s of thousands
>>>>
>>>> or
>>>>
>>>> more files and directories to scan for example).
>>>>
>>>> In the proposed model the directory to be scanned could be provided
>>>> dynamically by looking at an attribute of an incoming flowfile (or
>>>> other criteria can be provided - not just the directory to scan).
>>>>
>>>> In
>>>>
>>>> this case the ListFile processor goes on scanning against that now.
>>>> What about the previous directory (or directories) it was told to
>>>> scan? Does it still track those too? What if it starts scanning the
>>>> newly provided directory, hasn't finished pulling all the data or
>>>>
>>>> new
>>>>
>>>> data is continually arriving, and it is told to switch to another
>>>> directory.
>>>>
>>>> I think if those questions can get solid answers and someone
>>>>
>>>> invests
>>>>
>>>> time in creating a PR then this could be pretty powerful. Would be
>>>> good to see a written description of the use case(s) for this too.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>> Hello Devs,
>>>>
>>>> I would like to request a feature to a major processor, ListSFTP.
>>>>
>>>> But
>>>>
>>>> before
>>>>
>>>> I do down the official road, I wanted to ask if anyone thought it
>>>> was a
>>>> terrible idea or impossible, etc. The request is to add support
>>>> for an
>>>> incoming relationship to the ListSFTP processor specifically, but
>>>>
>>>> I
>>>>
>>>> could
>>>>
>>>> see it added to many of the commonly used head processes, such as
>>>>
>>>> ListFile.
>>>>
>>>> I would envision functionality more like InvokeHTTP or
>>>> ExecuteSQL, where
>>>>
>>>> an
>>>>
>>>> incoming flow file could initiate the action, and the attributes
>>>> in the
>>>> incoming flow file could be used to configure the processor
>>>>
>>>> actions.
>>>>
>>>> It's
>>>>
>>>> the configuration aspect that most appeals to me, because it
>>>> opens it up
>>>>
>>>> to
>>>>
>>>> being centrally or dynamically configured.
>>>>
>>>> Thanks,
>>>>
>>>> Scott
>>>>
>>>>
>>>>
>>>>

Re: ListSFTP incoming relationship

Posted by Joey Frazee <jo...@icloud.com>.

I worked on this at one point to make it "easier" (haha...) to process a very deep directory tree with 100k+ files -- idea was to break it up into subtrees for concurrency, etc. by looping the outgoing relationship back to input.

It ended up being being painful. Looking at the diff again the most annoying things were:

- The list processors extend AbstractListProcessor which made the change invasive. It felt dirty to alter the already existing incoming attrs for the purpose of leaving performListing alone so it required interface and implementation changes across other processors.

- The state problem already mentioned was a downer. The size of the state store wasn't a problem in practice, assuming a trusted client and some vaguely small number of dirs. What was more annoying is you have to move to directory name keys and have a state migration hook to preserve compability. Right now there's two values stored so it's N*2 (at the time I convinced myself those were redundant but there's a lot of edge cases that have been patched over time so dunno for sure).

Having tried it I am on team ScanX for a second stateless processor.

If state or a last modified guard of some kind is needed, it can be implemented at the flow level in a number of ways (DMC, LookupService, DetectDupe, etc.). This isn't possible with ListX because it doesn't take input so embedding the last modified filter in those is way more of a necessity.

-joey

On Apr 1, 2018, 12:06 PM -0500, Pierre Villard <pi...@gmail.com>, wrote:
> Hi Scott,
>
> In my opinion, based on the discussion here, I'd suggest you to implement
> the solution that you seem best to answer your needs and also taking in
> consideration all the feedback the community provided. Once you have
> something, best is to submit a pull request so that review and discussion
> can move forward on the implementation itself. I'd also recommend to file a
> JIRA with as much details as possible on what is the need, what are the
> options on the table and what is the implementation you want to propose
> (the more technical details you give, the sooner you'll get feedback for
> your code).
>
> Pierre
>
>
>
> 2018-04-01 18:40 GMT+02:00 scott <tc...@gmail.com>:
>
> > Okay. I guess I didn't realize how Nifi dev felt about risk tolerance. I
> > think I can work around it by adding duplicate filtering or implement some
> > other state management solution.
> > So, what's the next step?
> >
> > Scott
> >
> > On Thu, Mar 29, 2018, 10:46 AM Bryan Bende <bb...@gmail.com> wrote:
> >
> > > Scott,
> > >
> > > You are correct that the overall discussion is about allowing incoming
> > > flow files to ListSFTP.
> > >
> > > However, the previous discussion on this thread highlighted that the
> > > main reason ListSFTP currently doesn't allow incoming flow files is
> > > because of challenges when storing state.
> > >
> > > This led to the proposal of a new processor that allowed incoming flow
> > > files, but did not store state, thus avoiding the challenges mentioned
> > > above. If we were going to store state in this new processor, then
> > > we'd be back to the exact same challenges.
> > >
> > > Providing an option to turn on state also doesn't really help, because
> > > if there is an option provided to users,then the option will be used,
> > > and it needs to work when it is used.
> > >
> > > If we can come up with something that stores state and works well for
> > > all scenarios, then we aren't against it, we just need to handle the
> > > challenges highlighted by Joe's original email.
> > >
> > > Regarding some of the other ideas...
> > >
> > > The current output of ListSFTP already includes flow file attributes
> > > for each listing that include the full path, filename, last update
> > > time, owner, group, permissions, and file size.... were you thinking
> > > of something different than that?
> > >
> > > See the "Writes Attributes" section here:
> > >
> > > https://nifi.apache.org/docs/nifi-docs/components/org.
> > apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.
> > processors.standard.ListSFTP/index.html
> > >
> > > Thanks,
> > >
> > > Bryan
> > >
> > >
> > >
> > > On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <alopresto@apache.org
> > > wrote:
> > > > Scott,
> > > >
> > > > I think there are two conversations going on here. You are finding the
> > > > requirements for your specific use case, and that’s great. But I echo
> > > > Bryan’s point that a community processor for this scenario should not
> > > store
> > > > state at all. Sivaprasanna’s point that given dynamic directory input,
> > > > storing state based on that can cause massive data ingestion problems
> > > still
> > > > stands.
> > > >
> > > > For your specific use case, you can prototype (or possibly even get to
> > a
> > > > stable and robust-enough point) using ExecuteScript to model the
> > behavior
> > > > you need.
> > > >
> > > > In regards to the desired output format, I would suggest a few items:
> > > >
> > > > * Avro requires a schema to be defined, and this raises the barrier to
> > > use
> > > > of the processor. Also, unless being sent to a processor that
> > understands
> > > > Avro, the result will need to be converted anyway using Record*
> > > processors.
> > > > * If the output is individual flowfiles on a 1:1 basis, each should
> > have
> > > as
> > > > many attributes populated with the parsed information as possible (i.e.
> > > > file.name, file.path, file.size, file.owner, file.permissions, etc.).
> > > This
> > > > allows for easily-consumable and routable flowfiles.
> > > > * If the output is a full directory listing, I would suggest `ls -al`
> > > type
> > > > raw text output, or JSON (arbitrary human-readable and machine-readable
> > > > format with many consuming/transforming processors).
> > > >
> > > >
> > > > Andy LoPresto
> > > > alopresto@apache.org
> > > > alopresto.apache@gmail.com
> > > > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> > > >
> > > > On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
> > > >
> > > > Sorry Bryan, but I disagree with you. Not storing state is NOT the main
> > > > point of this new processor. The main point is to allow an incoming
> > > > relationship flowfile to trigger the action, and allow variables to be
> > > used
> > > > from the attributes therein.
> > > >
> > > > I agree that if the NiFi community deems it too risky to distribute
> > this
> > > > processor with state keeping optionally available, even if the default
> > > is to
> > > > disable it, then so be it. If state is not included optionally, then
> > how
> > > > about making the output flowfile content include more than just the
> > file
> > > > names? Have it include last updated time along with the filename. If it
> > > > searches recursively, you'll want to include the path to the file also.
> > > > Maybe it would be best to output the results into a structured format,
> > > such
> > > > as AVRO? Or, maybe it would just be best to output one flowfile per
> > > remote
> > > > file found, and include updated time and fully qualified path as
> > > attributes?
> > > >
> > > > Scott
> > > >
> > > >
> > > > On 03/29/2018 04:32 AM, Bryan Bende wrote:
> > > >
> > > > The main point of the new processor is to NOT store state so that it
> > > > becomes more reasonable to allow incoming flow files.
> > > >
> > > > You could probably implement your own custom processor that does both
> > > > because you can make assumptions about how you are going to use it, but
> > > if
> > > > the NiFi community provides one then it needs to work well for all
> > > > situations, such as dynamically listing hundreds of directories, which
> > is
> > > > problematic when state is involved.
> > > >
> > > > On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <
> > sivaprasanna246@gmail.com
> > > > wrote:
> > > >
> > > > Should we really have to have an optional state saving functionality?
> > If
> > > > the user is unaware of the implications and proceed to store the state
> > > then
> > > > what Andrew Grande mentioned will happen - possibilities of never
> > ending
> > > > stream of state information being stored. If we still go with the
> > > optional
> > > > state management approach, documentation have to be clear in explaining
> > > the
> > > > implications.
> > > >
> > > > Sivaprasanna
> > > >
> > > > On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
> > > >
> > > > Okay. So, a new processor called "ScanSFTP", allow incoming
> > relationship
> > > > where the content of the flow file is replaced with the list of
> > matching
> > > > files from the remote directory, then the list is filtered by the usual
> > > > regex parameters like today. Optional state information is kept to
> > > > additionally filter the list of files older than the newest file
> > > > observed during the last run. Does that sound okay to everyone? If so,
> > > > what's the next step?
> > > >
> > > > Scott
> > > >
> > > >
> > > > On 03/27/2018 06:21 PM, scott wrote:
> > > >
> > > > This is a great discussion, and appreciate the interest in my problem.
> > > > I think there are workarounds if you decide not to store state, but
> > > > I'd recommend keeping it. I think state should be kept optionally,
> > > > even turned off by default. Several times I've had issues where the
> > > > state has cause me to miss files, because files get moved into the
> > > > source folder out of order, and I've wished I could turn the state
> > > > feature off.
> > > >
> > > > In my current use-case, I would not be frequently, dynamically
> > > > changing the source directory, though I can see the use-cases where it
> > > > would be. In my current use-case, I want to use an external database
> > > > table to control the configuration of all my flows. I do this by first
> > > > reading the content of the table for this particular flow ID, then
> > > > assign the result as attributes to the flowfile, essentially creating
> > > > variables I can use throughout the flow to control its behavior. This
> > > > works great with flows that initiate with HTTP or SQL, but not
> > > > ListSFTP or ListFile.
> > > >
> > > > Scott
> > > >
> > > >
> > > > On 03/27/2018 02:05 PM, Andy LoPresto wrote:
> > > >
> > > > I think Bryan’s point is a good one and when I first saw this
> > > > question (and thought of the previous times it’s been asked), my
> > > > initial response is to propose a second processor.
> > > >
> > > > Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> > > > differently from ListSFTP — it does not maintain state, and performs
> > > > a one-time tabulation/chronicling of the state of that directory at
> > > > the given point in time.
> > > >
> > > > The responsibility to maintain and compare state across time is no
> > > > longer a requirement. There could even be a setting in the processor
> > > > to allow for “individual flowfile output” (i.e. act the same as
> > > > ListSFTP and output one flowfile per item listed) or “summary
> > > > flowfile output” where a single flowfile is generated containing the
> > > > directory listing information for all the items there. (Another
> > > > option is to output both on two different relationships).
> > > >
> > > > I think this would enable the types of workflows that users have
> > > > asked about in the past without compromising the mechanism by which
> > > > List* processors work and adding undue complexity to those processors.
> > > >
> > > > Absolutely crystal clear documentation (and a standard verb for the
> > > > new processor family) would be necessary (not only because these
> > > > processor solve different problems, but to avoid a million variants
> > > > of “I used ScanSFTP processor and it’s not tracking state”/“How do I
> > > > provide a directory in an attribute to ListSFTP” mailing list
> > > > questions).
> > > >
> > > >
> > > > Andy LoPresto
> > > > alopresto@apache.org <mailto:alopresto@apache.org
> > > > /alopresto.apache@gmail.com <ma...@gmail.com>/
> > > > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> > > >
> > > > On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
> > > > <ma...@gmail.com>> wrote:
> > > >
> > > > The key here is that ListXXX processor maintains state. A directory
> > > > is part
> > > > of such state. Allowing arbitrary directories via an expression would
> > > > create never ending stream of new entries in the state storage,
> > > > effectively
> > > > engineering a distributed DoS attack on the NiFi node or shared ZK
> > > > quorum
> > > > (for when state is stored in there).
> > > >
> > > > Maybe if we focus on thinking about assumptions and restrictions the
> > > > processor should make to contain that risk...
> > > >
> > > > Andrew
> > > >
> > > > On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
> > > > <ma...@gmail.com>> wrote:
> > > >
> > > > I'm not sure that would solve the problem because you'd still be
> > > > limited to one directory. What most people are asking for is the
> > > > ability to use a dynamic directory from an incoming flow file.
> > > >
> > > > I think we might be trying to fit two different use-cases into one
> > > > processor which might not make sense.
> > > >
> > > > Scenario #1... There is a directory that is constantly receiving new
> > > > data and has a significant amount of files, and I want to
> > > >
> > > > periodically
> > > >
> > > > find new files. This is what the current processors are optimized
> > > >
> > > > for.
> > > >
> > > > Scenario #2... There is a directory that is mostly static with a
> > > > moderate/small number of files, and at points in my flow I want to
> > > > dynamically perform a listing of this directory and retrieve the
> > > > files. This is more geared towards the mentality of running a
> > > > job/workflow.
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
> > > > <ottobackwards@gmail.com <mailto:ottobackwards@gmail.com
> > > > wrote:
> > > >
> > > > What if the changes where ‘on top of’ some base set of properties,
> > > > like
> > > > directory?
> > > > Like a filter, where if present from the incoming file will have
> > > >
> > > > the
> > > >
> > > > LIST*
> > > >
> > > > list only things
> > > > that match a name or attribute?
> > > >
> > > >
> > > >
> > > > On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
> > > > <ma...@gmail.com>) wrote:
> > > >
> > > > Scott
> > > >
> > > > This idea has come up a couple of times and there is definitely
> > > > something intriguing to it. Where I think this idea stalls out
> > > >
> > > > though
> > > >
> > > > is in implementation.
> > > >
> > > > While I agree that the other List* processors might similarly
> > > >
> > > > benefit
> > > >
> > > > lets focus on ListFile. Today you tell ListFile what directory to
> > > > start looking for files in. It goes off scanning that directory for
> > > > hits and stores state about what it has already searched/seen. And
> > > >
> > > > it
> > > >
> > > > is important to keep track of how much it has already scanned
> > > >
> > > > because
> > > >
> > > > at times the search directory can be massive (100,000s of thousands
> > > >
> > > > or
> > > >
> > > > more files and directories to scan for example).
> > > >
> > > > In the proposed model the directory to be scanned could be provided
> > > > dynamically by looking at an attribute of an incoming flowfile (or
> > > > other criteria can be provided - not just the directory to scan).
> > > >
> > > > In
> > > >
> > > > this case the ListFile processor goes on scanning against that now.
> > > > What about the previous directory (or directories) it was told to
> > > > scan? Does it still track those too? What if it starts scanning the
> > > > newly provided directory, hasn't finished pulling all the data or
> > > >
> > > > new
> > > >
> > > > data is continually arriving, and it is told to switch to another
> > > > directory.
> > > >
> > > > I think if those questions can get solid answers and someone
> > > >
> > > > invests
> > > >
> > > > time in creating a PR then this could be pretty powerful. Would be
> > > > good to see a written description of the use case(s) for this too.
> > > >
> > > > Thanks
> > > > Joe
> > > >
> > > > On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
> > > > <ma...@gmail.com>> wrote:
> > > >
> > > > Hello Devs,
> > > >
> > > > I would like to request a feature to a major processor, ListSFTP.
> > > >
> > > > But
> > > >
> > > > before
> > > >
> > > > I do down the official road, I wanted to ask if anyone thought it
> > > > was a
> > > > terrible idea or impossible, etc. The request is to add support
> > > > for an
> > > > incoming relationship to the ListSFTP processor specifically, but
> > > >
> > > > I
> > > >
> > > > could
> > > >
> > > > see it added to many of the commonly used head processes, such as
> > > >
> > > > ListFile.
> > > >
> > > > I would envision functionality more like InvokeHTTP or
> > > > ExecuteSQL, where
> > > >
> > > > an
> > > >
> > > > incoming flow file could initiate the action, and the attributes
> > > > in the
> > > > incoming flow file could be used to configure the processor
> > > >
> > > > actions.
> > > >
> > > > It's
> > > >
> > > > the configuration aspect that most appeals to me, because it
> > > > opens it up
> > > >
> > > > to
> > > >
> > > > being centrally or dynamically configured.
> > > >
> > > > Thanks,
> > > >
> > > > Scott
> > > >
> > > >
> > > >
> > > >
> > >
> >

Re: ListSFTP incoming relationship

Posted by Pierre Villard <pi...@gmail.com>.

Hi Scott,

In my opinion, based on the discussion here, I'd suggest you to implement
the solution that you seem best to answer your needs and also taking in
consideration all the feedback the community provided. Once you have
something, best is to submit a pull request so that review and discussion
can move forward on the implementation itself. I'd also recommend to file a
JIRA with as much details as possible on what is the need, what are the
options on the table and what is the implementation you want to propose
(the more technical details you give, the sooner you'll get feedback for
your code).

Pierre



2018-04-01 18:40 GMT+02:00 scott <tc...@gmail.com>:

> Okay. I guess I didn't realize how Nifi dev felt about risk tolerance. I
> think I can work around it by adding duplicate filtering or implement some
> other state management solution.
> So, what's the next step?
>
> Scott
>
> On Thu, Mar 29, 2018, 10:46 AM Bryan Bende <bb...@gmail.com> wrote:
>
> > Scott,
> >
> > You are correct that the overall discussion is about allowing incoming
> > flow files to ListSFTP.
> >
> > However, the previous discussion on this thread highlighted that the
> > main reason ListSFTP currently doesn't allow incoming flow files is
> > because of challenges when storing state.
> >
> > This led to the proposal of a new processor that allowed incoming flow
> > files, but did not store state, thus avoiding the challenges mentioned
> > above. If we were going to store state in this new processor, then
> > we'd be back to the exact same challenges.
> >
> > Providing an option to turn on state also doesn't really help, because
> > if there is an option provided to users,then the option will be used,
> > and it needs to work when it is used.
> >
> > If we can come up with something that stores state and works well for
> > all scenarios, then we aren't against it, we just need to handle the
> > challenges highlighted by Joe's original email.
> >
> > Regarding some of the other ideas...
> >
> > The current output of ListSFTP already includes flow file attributes
> > for each listing that include the full path, filename, last update
> > time, owner, group, permissions, and file size.... were you thinking
> > of something different than that?
> >
> > See the "Writes Attributes" section here:
> >
> > https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.
> processors.standard.ListSFTP/index.html
> >
> > Thanks,
> >
> > Bryan
> >
> >
> >
> > On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <al...@apache.org>
> > wrote:
> > > Scott,
> > >
> > > I think there are two conversations going on here. You are finding the
> > > requirements for your specific use case, and that’s great. But I echo
> > > Bryan’s point that a community processor for this scenario should not
> > store
> > > state at all. Sivaprasanna’s point that given dynamic directory input,
> > > storing state based on that can cause massive data ingestion problems
> > still
> > > stands.
> > >
> > > For your specific use case, you can prototype (or possibly even get to
> a
> > > stable and robust-enough point) using ExecuteScript to model the
> behavior
> > > you need.
> > >
> > > In regards to the desired output format, I would suggest a few items:
> > >
> > > * Avro requires a schema to be defined, and this raises the barrier to
> > use
> > > of the processor. Also, unless being sent to a processor that
> understands
> > > Avro, the result will need to be converted anyway using Record*
> > processors.
> > > * If the output is individual flowfiles on a 1:1 basis, each should
> have
> > as
> > > many attributes populated with the parsed information as possible (i.e.
> > > file.name, file.path, file.size, file.owner, file.permissions, etc.).
> > This
> > > allows for easily-consumable and routable flowfiles.
> > > * If the output is a full directory listing, I would suggest `ls -al`
> > type
> > > raw text output, or JSON (arbitrary human-readable and machine-readable
> > > format with many consuming/transforming processors).
> > >
> > >
> > > Andy LoPresto
> > > alopresto@apache.org
> > > alopresto.apache@gmail.com
> > > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> > >
> > > On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
> > >
> > > Sorry Bryan, but I disagree with you. Not storing state is NOT the main
> > > point of this new processor. The main point is to allow an incoming
> > > relationship flowfile to trigger the action, and allow variables to be
> > used
> > > from the attributes therein.
> > >
> > > I agree that if the NiFi community deems it too risky to distribute
> this
> > > processor with state keeping optionally available, even if the default
> > is to
> > > disable it, then so be it. If state is not included optionally, then
> how
> > > about making the output flowfile content include more than just the
> file
> > > names? Have it include last updated time along with the filename. If it
> > > searches recursively, you'll want to include the path to the file also.
> > > Maybe it would be best to output the results into a structured format,
> > such
> > > as AVRO? Or, maybe it would just be best to output one flowfile per
> > remote
> > > file found, and include updated time and fully qualified path as
> > attributes?
> > >
> > > Scott
> > >
> > >
> > > On 03/29/2018 04:32 AM, Bryan Bende wrote:
> > >
> > > The main point of the new processor is to NOT store state so that it
> > > becomes more reasonable to allow incoming flow files.
> > >
> > > You could probably implement your own custom processor that does both
> > > because you can make assumptions about how you are going to use it, but
> > if
> > > the NiFi community provides one then it needs to work well for all
> > > situations, such as dynamically listing hundreds of directories, which
> is
> > > problematic when state is involved.
> > >
> > > On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <
> sivaprasanna246@gmail.com>
> > > wrote:
> > >
> > > Should we really have to have an optional state saving functionality?
> If
> > > the user is unaware of the implications and proceed to store the state
> > then
> > > what Andrew Grande mentioned will happen - possibilities of never
> ending
> > > stream of state information being stored. If we still go with the
> > optional
> > > state management approach, documentation have to be clear in explaining
> > the
> > > implications.
> > >
> > > Sivaprasanna
> > >
> > > On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
> > >
> > > Okay. So, a new processor called "ScanSFTP", allow incoming
> relationship
> > > where the content of the flow file is replaced with the list of
> matching
> > > files from the remote directory, then the list is filtered by the usual
> > > regex parameters like today. Optional state information is kept to
> > > additionally filter the list of files older than the newest file
> > > observed during the last run. Does that sound okay to everyone? If so,
> > > what's the next step?
> > >
> > > Scott
> > >
> > >
> > > On 03/27/2018 06:21 PM, scott wrote:
> > >
> > > This is a great discussion, and appreciate the interest in my problem.
> > > I think there are workarounds if you decide not to store state, but
> > > I'd recommend keeping it. I think state should be kept optionally,
> > > even turned off by default. Several times I've had issues where the
> > > state has cause me to miss files, because files get moved into the
> > > source folder out of order, and I've wished I could turn the state
> > > feature off.
> > >
> > > In my current use-case, I would not be frequently, dynamically
> > > changing the source directory, though I can see the use-cases where it
> > > would be. In my current use-case, I want to use an external database
> > > table to control the configuration of all my flows. I do this by first
> > > reading the content of the table for this particular flow ID, then
> > > assign the result as attributes to the flowfile, essentially creating
> > > variables I can use throughout the flow to control its behavior. This
> > > works great with flows that initiate with HTTP or SQL, but not
> > > ListSFTP or ListFile.
> > >
> > > Scott
> > >
> > >
> > > On 03/27/2018 02:05 PM, Andy LoPresto wrote:
> > >
> > > I think Bryan’s point is a good one and when I first saw this
> > > question (and thought of the previous times it’s been asked), my
> > > initial response is to propose a second processor.
> > >
> > > Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> > > differently from ListSFTP — it does not maintain state, and performs
> > > a one-time tabulation/chronicling of the state of that directory at
> > > the given point in time.
> > >
> > > The responsibility to maintain and compare state across time is no
> > > longer a requirement. There could even be a setting in the processor
> > > to allow for “individual flowfile output” (i.e. act the same as
> > > ListSFTP and output one flowfile per item listed) or “summary
> > > flowfile output” where a single flowfile is generated containing the
> > > directory listing information for all the items there. (Another
> > > option is to output both on two different relationships).
> > >
> > > I think this would enable the types of workflows that users have
> > > asked about in the past without compromising the mechanism by which
> > > List* processors work and adding undue complexity to those processors.
> > >
> > > Absolutely crystal clear documentation (and a standard verb for the
> > > new processor family) would be necessary (not only because these
> > > processor solve different problems, but to avoid a million variants
> > > of “I used ScanSFTP processor and it’s not tracking state”/“How do I
> > > provide a directory in an attribute to ListSFTP” mailing list
> > > questions).
> > >
> > >
> > > Andy LoPresto
> > > alopresto@apache.org <ma...@apache.org>
> > > /alopresto.apache@gmail.com <ma...@gmail.com>/
> > > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> > >
> > > On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
> > > <ma...@gmail.com>> wrote:
> > >
> > > The key here is that ListXXX processor maintains state. A directory
> > > is part
> > > of such state. Allowing arbitrary directories via an expression would
> > > create never ending stream of new entries in the state storage,
> > > effectively
> > > engineering a distributed DoS attack on the NiFi node or shared ZK
> > > quorum
> > > (for when state is stored in there).
> > >
> > > Maybe if we focus on thinking about assumptions and restrictions the
> > > processor should make to contain that risk...
> > >
> > > Andrew
> > >
> > > On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
> > > <ma...@gmail.com>> wrote:
> > >
> > > I'm not sure that would solve the problem because you'd still be
> > > limited to one directory. What most people are asking for is the
> > > ability to use a dynamic directory from an incoming flow file.
> > >
> > > I think we might be trying to fit two different use-cases into one
> > > processor which might not make sense.
> > >
> > > Scenario #1... There is a directory that is constantly receiving new
> > > data and has a significant amount of files, and I want to
> > >
> > > periodically
> > >
> > > find new files. This is what the current processors are optimized
> > >
> > > for.
> > >
> > > Scenario #2... There is a directory that is mostly static with a
> > > moderate/small number of files, and at points in my flow I want to
> > > dynamically perform a listing of this directory and retrieve the
> > > files. This is more geared towards the mentality of running a
> > > job/workflow.
> > >
> > >
> > >
> > >
> > > On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
> > > <ottobackwards@gmail.com <ma...@gmail.com>>
> > > wrote:
> > >
> > > What if the changes where ‘on top of’ some base set of properties,
> > > like
> > > directory?
> > > Like a filter, where if present from the incoming file will have
> > >
> > > the
> > >
> > > LIST*
> > >
> > > list only things
> > > that match a name or attribute?
> > >
> > >
> > >
> > > On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
> > > <ma...@gmail.com>) wrote:
> > >
> > > Scott
> > >
> > > This idea has come up a couple of times and there is definitely
> > > something intriguing to it. Where I think this idea stalls out
> > >
> > > though
> > >
> > > is in implementation.
> > >
> > > While I agree that the other List* processors might similarly
> > >
> > > benefit
> > >
> > > lets focus on ListFile. Today you tell ListFile what directory to
> > > start looking for files in. It goes off scanning that directory for
> > > hits and stores state about what it has already searched/seen. And
> > >
> > > it
> > >
> > > is important to keep track of how much it has already scanned
> > >
> > > because
> > >
> > > at times the search directory can be massive (100,000s of thousands
> > >
> > > or
> > >
> > > more files and directories to scan for example).
> > >
> > > In the proposed model the directory to be scanned could be provided
> > > dynamically by looking at an attribute of an incoming flowfile (or
> > > other criteria can be provided - not just the directory to scan).
> > >
> > > In
> > >
> > > this case the ListFile processor goes on scanning against that now.
> > > What about the previous directory (or directories) it was told to
> > > scan? Does it still track those too? What if it starts scanning the
> > > newly provided directory, hasn't finished pulling all the data or
> > >
> > > new
> > >
> > > data is continually arriving, and it is told to switch to another
> > > directory.
> > >
> > > I think if those questions can get solid answers and someone
> > >
> > > invests
> > >
> > > time in creating a PR then this could be pretty powerful. Would be
> > > good to see a written description of the use case(s) for this too.
> > >
> > > Thanks
> > > Joe
> > >
> > > On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
> > > <ma...@gmail.com>> wrote:
> > >
> > > Hello Devs,
> > >
> > > I would like to request a feature to a major processor, ListSFTP.
> > >
> > > But
> > >
> > > before
> > >
> > > I do down the official road, I wanted to ask if anyone thought it
> > > was a
> > > terrible idea or impossible, etc. The request is to add support
> > > for an
> > > incoming relationship to the ListSFTP processor specifically, but
> > >
> > > I
> > >
> > > could
> > >
> > > see it added to many of the commonly used head processes, such as
> > >
> > > ListFile.
> > >
> > > I would envision functionality more like InvokeHTTP or
> > > ExecuteSQL, where
> > >
> > > an
> > >
> > > incoming flow file could initiate the action, and the attributes
> > > in the
> > > incoming flow file could be used to configure the processor
> > >
> > > actions.
> > >
> > > It's
> > >
> > > the configuration aspect that most appeals to me, because it
> > > opens it up
> > >
> > > to
> > >
> > > being centrally or dynamically configured.
> > >
> > > Thanks,
> > >
> > > Scott
> > >
> > >
> > >
> > >
> >
>

Re: ListSFTP incoming relationship

Posted by scott <tc...@gmail.com>.

Okay. I guess I didn't realize how Nifi dev felt about risk tolerance. I
think I can work around it by adding duplicate filtering or implement some
other state management solution.
So, what's the next step?

Scott

On Thu, Mar 29, 2018, 10:46 AM Bryan Bende <bb...@gmail.com> wrote:

> Scott,
>
> You are correct that the overall discussion is about allowing incoming
> flow files to ListSFTP.
>
> However, the previous discussion on this thread highlighted that the
> main reason ListSFTP currently doesn't allow incoming flow files is
> because of challenges when storing state.
>
> This led to the proposal of a new processor that allowed incoming flow
> files, but did not store state, thus avoiding the challenges mentioned
> above. If we were going to store state in this new processor, then
> we'd be back to the exact same challenges.
>
> Providing an option to turn on state also doesn't really help, because
> if there is an option provided to users,then the option will be used,
> and it needs to work when it is used.
>
> If we can come up with something that stores state and works well for
> all scenarios, then we aren't against it, we just need to handle the
> challenges highlighted by Joe's original email.
>
> Regarding some of the other ideas...
>
> The current output of ListSFTP already includes flow file attributes
> for each listing that include the full path, filename, last update
> time, owner, group, permissions, and file size.... were you thinking
> of something different than that?
>
> See the "Writes Attributes" section here:
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ListSFTP/index.html
>
> Thanks,
>
> Bryan
>
>
>
> On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <al...@apache.org>
> wrote:
> > Scott,
> >
> > I think there are two conversations going on here. You are finding the
> > requirements for your specific use case, and that’s great. But I echo
> > Bryan’s point that a community processor for this scenario should not
> store
> > state at all. Sivaprasanna’s point that given dynamic directory input,
> > storing state based on that can cause massive data ingestion problems
> still
> > stands.
> >
> > For your specific use case, you can prototype (or possibly even get to a
> > stable and robust-enough point) using ExecuteScript to model the behavior
> > you need.
> >
> > In regards to the desired output format, I would suggest a few items:
> >
> > * Avro requires a schema to be defined, and this raises the barrier to
> use
> > of the processor. Also, unless being sent to a processor that understands
> > Avro, the result will need to be converted anyway using Record*
> processors.
> > * If the output is individual flowfiles on a 1:1 basis, each should have
> as
> > many attributes populated with the parsed information as possible (i.e.
> > file.name, file.path, file.size, file.owner, file.permissions, etc.).
> This
> > allows for easily-consumable and routable flowfiles.
> > * If the output is a full directory listing, I would suggest `ls -al`
> type
> > raw text output, or JSON (arbitrary human-readable and machine-readable
> > format with many consuming/transforming processors).
> >
> >
> > Andy LoPresto
> > alopresto@apache.org
> > alopresto.apache@gmail.com
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
> >
> > Sorry Bryan, but I disagree with you. Not storing state is NOT the main
> > point of this new processor. The main point is to allow an incoming
> > relationship flowfile to trigger the action, and allow variables to be
> used
> > from the attributes therein.
> >
> > I agree that if the NiFi community deems it too risky to distribute this
> > processor with state keeping optionally available, even if the default
> is to
> > disable it, then so be it. If state is not included optionally, then how
> > about making the output flowfile content include more than just the file
> > names? Have it include last updated time along with the filename. If it
> > searches recursively, you'll want to include the path to the file also.
> > Maybe it would be best to output the results into a structured format,
> such
> > as AVRO? Or, maybe it would just be best to output one flowfile per
> remote
> > file found, and include updated time and fully qualified path as
> attributes?
> >
> > Scott
> >
> >
> > On 03/29/2018 04:32 AM, Bryan Bende wrote:
> >
> > The main point of the new processor is to NOT store state so that it
> > becomes more reasonable to allow incoming flow files.
> >
> > You could probably implement your own custom processor that does both
> > because you can make assumptions about how you are going to use it, but
> if
> > the NiFi community provides one then it needs to work well for all
> > situations, such as dynamically listing hundreds of directories, which is
> > problematic when state is involved.
> >
> > On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <si...@gmail.com>
> > wrote:
> >
> > Should we really have to have an optional state saving functionality? If
> > the user is unaware of the implications and proceed to store the state
> then
> > what Andrew Grande mentioned will happen - possibilities of never ending
> > stream of state information being stored. If we still go with the
> optional
> > state management approach, documentation have to be clear in explaining
> the
> > implications.
> >
> > Sivaprasanna
> >
> > On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
> >
> > Okay. So, a new processor called "ScanSFTP", allow incoming relationship
> > where the content of the flow file is replaced with the list of matching
> > files from the remote directory, then the list is filtered by the usual
> > regex parameters like today. Optional state information is kept to
> > additionally filter the list of files older than the newest file
> > observed during the last run. Does that sound okay to everyone? If so,
> > what's the next step?
> >
> > Scott
> >
> >
> > On 03/27/2018 06:21 PM, scott wrote:
> >
> > This is a great discussion, and appreciate the interest in my problem.
> > I think there are workarounds if you decide not to store state, but
> > I'd recommend keeping it. I think state should be kept optionally,
> > even turned off by default. Several times I've had issues where the
> > state has cause me to miss files, because files get moved into the
> > source folder out of order, and I've wished I could turn the state
> > feature off.
> >
> > In my current use-case, I would not be frequently, dynamically
> > changing the source directory, though I can see the use-cases where it
> > would be. In my current use-case, I want to use an external database
> > table to control the configuration of all my flows. I do this by first
> > reading the content of the table for this particular flow ID, then
> > assign the result as attributes to the flowfile, essentially creating
> > variables I can use throughout the flow to control its behavior. This
> > works great with flows that initiate with HTTP or SQL, but not
> > ListSFTP or ListFile.
> >
> > Scott
> >
> >
> > On 03/27/2018 02:05 PM, Andy LoPresto wrote:
> >
> > I think Bryan’s point is a good one and when I first saw this
> > question (and thought of the previous times it’s been asked), my
> > initial response is to propose a second processor.
> >
> > Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> > differently from ListSFTP — it does not maintain state, and performs
> > a one-time tabulation/chronicling of the state of that directory at
> > the given point in time.
> >
> > The responsibility to maintain and compare state across time is no
> > longer a requirement. There could even be a setting in the processor
> > to allow for “individual flowfile output” (i.e. act the same as
> > ListSFTP and output one flowfile per item listed) or “summary
> > flowfile output” where a single flowfile is generated containing the
> > directory listing information for all the items there. (Another
> > option is to output both on two different relationships).
> >
> > I think this would enable the types of workflows that users have
> > asked about in the past without compromising the mechanism by which
> > List* processors work and adding undue complexity to those processors.
> >
> > Absolutely crystal clear documentation (and a standard verb for the
> > new processor family) would be necessary (not only because these
> > processor solve different problems, but to avoid a million variants
> > of “I used ScanSFTP processor and it’s not tracking state”/“How do I
> > provide a directory in an attribute to ListSFTP” mailing list
> > questions).
> >
> >
> > Andy LoPresto
> > alopresto@apache.org <ma...@apache.org>
> > /alopresto.apache@gmail.com <ma...@gmail.com>/
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> > The key here is that ListXXX processor maintains state. A directory
> > is part
> > of such state. Allowing arbitrary directories via an expression would
> > create never ending stream of new entries in the state storage,
> > effectively
> > engineering a distributed DoS attack on the NiFi node or shared ZK
> > quorum
> > (for when state is stored in there).
> >
> > Maybe if we focus on thinking about assumptions and restrictions the
> > processor should make to contain that risk...
> >
> > Andrew
> >
> > On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> > I'm not sure that would solve the problem because you'd still be
> > limited to one directory. What most people are asking for is the
> > ability to use a dynamic directory from an incoming flow file.
> >
> > I think we might be trying to fit two different use-cases into one
> > processor which might not make sense.
> >
> > Scenario #1... There is a directory that is constantly receiving new
> > data and has a significant amount of files, and I want to
> >
> > periodically
> >
> > find new files. This is what the current processors are optimized
> >
> > for.
> >
> > Scenario #2... There is a directory that is mostly static with a
> > moderate/small number of files, and at points in my flow I want to
> > dynamically perform a listing of this directory and retrieve the
> > files. This is more geared towards the mentality of running a
> > job/workflow.
> >
> >
> >
> >
> > On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
> > <ottobackwards@gmail.com <ma...@gmail.com>>
> > wrote:
> >
> > What if the changes where ‘on top of’ some base set of properties,
> > like
> > directory?
> > Like a filter, where if present from the incoming file will have
> >
> > the
> >
> > LIST*
> >
> > list only things
> > that match a name or attribute?
> >
> >
> >
> > On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
> > <ma...@gmail.com>) wrote:
> >
> > Scott
> >
> > This idea has come up a couple of times and there is definitely
> > something intriguing to it. Where I think this idea stalls out
> >
> > though
> >
> > is in implementation.
> >
> > While I agree that the other List* processors might similarly
> >
> > benefit
> >
> > lets focus on ListFile. Today you tell ListFile what directory to
> > start looking for files in. It goes off scanning that directory for
> > hits and stores state about what it has already searched/seen. And
> >
> > it
> >
> > is important to keep track of how much it has already scanned
> >
> > because
> >
> > at times the search directory can be massive (100,000s of thousands
> >
> > or
> >
> > more files and directories to scan for example).
> >
> > In the proposed model the directory to be scanned could be provided
> > dynamically by looking at an attribute of an incoming flowfile (or
> > other criteria can be provided - not just the directory to scan).
> >
> > In
> >
> > this case the ListFile processor goes on scanning against that now.
> > What about the previous directory (or directories) it was told to
> > scan? Does it still track those too? What if it starts scanning the
> > newly provided directory, hasn't finished pulling all the data or
> >
> > new
> >
> > data is continually arriving, and it is told to switch to another
> > directory.
> >
> > I think if those questions can get solid answers and someone
> >
> > invests
> >
> > time in creating a PR then this could be pretty powerful. Would be
> > good to see a written description of the use case(s) for this too.
> >
> > Thanks
> > Joe
> >
> > On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> > Hello Devs,
> >
> > I would like to request a feature to a major processor, ListSFTP.
> >
> > But
> >
> > before
> >
> > I do down the official road, I wanted to ask if anyone thought it
> > was a
> > terrible idea or impossible, etc. The request is to add support
> > for an
> > incoming relationship to the ListSFTP processor specifically, but
> >
> > I
> >
> > could
> >
> > see it added to many of the commonly used head processes, such as
> >
> > ListFile.
> >
> > I would envision functionality more like InvokeHTTP or
> > ExecuteSQL, where
> >
> > an
> >
> > incoming flow file could initiate the action, and the attributes
> > in the
> > incoming flow file could be used to configure the processor
> >
> > actions.
> >
> > It's
> >
> > the configuration aspect that most appeals to me, because it
> > opens it up
> >
> > to
> >
> > being centrally or dynamically configured.
> >
> > Thanks,
> >
> > Scott
> >
> >
> >
> >
>

Re: ListSFTP incoming relationship

Posted by Andy LoPresto <al...@gmail.com>.

You would still need to store the state per-directory-scanned, and this would scale with the number of directories used — this raises resolution questions as well, like does “~” == “/usr/xyz/home”, “/Users/xyz”, etc.? Is “~” the user NiFi is running as? 

So eventually you will end up with a map of some kind using resolved or unresolved directories as the keys and some state indicator (timestamp or otherwise) as the value. How long do you wait to age these values out? What if it scales to the hundreds of thousands of different key entries? The incoming attribute can have unbounded range, so there is no guarantee on the upper limit. 

I think the “minimum value” idea scales for a single directory listing, but not on the orthogonal axis for many possible directory values. 

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 29, 2018, at 13:15, Charlie Meyer <ch...@civitaslearning.com> wrote:
> 
> Just a thought,
> 
> 
> Could a processor that did the scan and stored state be implemented similar
> to GenerateTableFetch, where there is a minimum value attribute that is
> specified that could be read from the source (such as created date, updated
> date, etc)? That way the state could potentially be manageable.
> 
>> On Thu, Mar 29, 2018 at 2:43 PM, Andy LoPresto <al...@apache.org> wrote:
>> 
>> Bryan,
>> 
>> No, that was exactly what I was referencing regarding the attribute
>> output. It would have been clearer if I had said it like you did. Thanks.
>> 
>> Andy LoPresto
>> alopresto@apache.org
>> *alopresto.apache@gmail.com <al...@gmail.com>*
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On Mar 29, 2018, at 10:46 AM, Bryan Bende <bb...@gmail.com> wrote:
>> 
>> Scott,
>> 
>> You are correct that the overall discussion is about allowing incoming
>> flow files to ListSFTP.
>> 
>> However, the previous discussion on this thread highlighted that the
>> main reason ListSFTP currently doesn't allow incoming flow files is
>> because of challenges when storing state.
>> 
>> This led to the proposal of a new processor that allowed incoming flow
>> files, but did not store state, thus avoiding the challenges mentioned
>> above. If we were going to store state in this new processor, then
>> we'd be back to the exact same challenges.
>> 
>> Providing an option to turn on state also doesn't really help, because
>> if there is an option provided to users,then the option will be used,
>> and it needs to work when it is used.
>> 
>> If we can come up with something that stores state and works well for
>> all scenarios, then we aren't against it, we just need to handle the
>> challenges highlighted by Joe's original email.
>> 
>> Regarding some of the other ideas...
>> 
>> The current output of ListSFTP already includes flow file attributes
>> for each listing that include the full path, filename, last update
>> time, owner, group, permissions, and file size.... were you thinking
>> of something different than that?
>> 
>> See the "Writes Attributes" section here:
>> https://nifi.apache.org/docs/nifi-docs/components/org.
>> apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.
>> processors.standard.ListSFTP/index.html
>> 
>> Thanks,
>> 
>> Bryan
>> 
>> 
>> 
>> On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <al...@apache.org>
>> wrote:
>> 
>> Scott,
>> 
>> I think there are two conversations going on here. You are finding the
>> requirements for your specific use case, and that’s great. But I echo
>> Bryan’s point that a community processor for this scenario should not store
>> state at all. Sivaprasanna’s point that given dynamic directory input,
>> storing state based on that can cause massive data ingestion problems still
>> stands.
>> 
>> For your specific use case, you can prototype (or possibly even get to a
>> stable and robust-enough point) using ExecuteScript to model the behavior
>> you need.
>> 
>> In regards to the desired output format, I would suggest a few items:
>> 
>> * Avro requires a schema to be defined, and this raises the barrier to use
>> of the processor. Also, unless being sent to a processor that understands
>> Avro, the result will need to be converted anyway using Record* processors.
>> * If the output is individual flowfiles on a 1:1 basis, each should have as
>> many attributes populated with the parsed information as possible (i.e.
>> file.name, file.path, file.size, file.owner, file.permissions, etc.). This
>> allows for easily-consumable and routable flowfiles.
>> * If the output is a full directory listing, I would suggest `ls -al` type
>> raw text output, or JSON (arbitrary human-readable and machine-readable
>> format with many consuming/transforming processors).
>> 
>> 
>> Andy LoPresto
>> alopresto@apache.org
>> alopresto.apache@gmail.com
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
>> 
>> Sorry Bryan, but I disagree with you. Not storing state is NOT the main
>> point of this new processor. The main point is to allow an incoming
>> relationship flowfile to trigger the action, and allow variables to be used
>> from the attributes therein.
>> 
>> I agree that if the NiFi community deems it too risky to distribute this
>> processor with state keeping optionally available, even if the default is
>> to
>> disable it, then so be it. If state is not included optionally, then how
>> about making the output flowfile content include more than just the file
>> names? Have it include last updated time along with the filename. If it
>> searches recursively, you'll want to include the path to the file also.
>> Maybe it would be best to output the results into a structured format, such
>> as AVRO? Or, maybe it would just be best to output one flowfile per remote
>> file found, and include updated time and fully qualified path as
>> attributes?
>> 
>> Scott
>> 
>> 
>> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>> 
>> The main point of the new processor is to NOT store state so that it
>> becomes more reasonable to allow incoming flow files.
>> 
>> You could probably implement your own custom processor that does both
>> because you can make assumptions about how you are going to use it, but if
>> the NiFi community provides one then it needs to work well for all
>> situations, such as dynamically listing hundreds of directories, which is
>> problematic when state is involved.
>> 
>> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <si...@gmail.com>
>> wrote:
>> 
>> Should we really have to have an optional state saving functionality? If
>> the user is unaware of the implications and proceed to store the state then
>> what Andrew Grande mentioned will happen - possibilities of never ending
>> stream of state information being stored. If we still go with the optional
>> state management approach, documentation have to be clear in explaining the
>> implications.
>> 
>> Sivaprasanna
>> 
>> On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
>> 
>> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
>> where the content of the flow file is replaced with the list of matching
>> files from the remote directory, then the list is filtered by the usual
>> regex parameters like today. Optional state information is kept to
>> additionally filter the list of files older than the newest file
>> observed during the last run. Does that sound okay to everyone? If so,
>> what's the next step?
>> 
>> Scott
>> 
>> 
>> On 03/27/2018 06:21 PM, scott wrote:
>> 
>> This is a great discussion, and appreciate the interest in my problem.
>> I think there are workarounds if you decide not to store state, but
>> I'd recommend keeping it. I think state should be kept optionally,
>> even turned off by default. Several times I've had issues where the
>> state has cause me to miss files, because files get moved into the
>> source folder out of order, and I've wished I could turn the state
>> feature off.
>> 
>> In my current use-case, I would not be frequently, dynamically
>> changing the source directory, though I can see the use-cases where it
>> would be. In my current use-case, I want to use an external database
>> table to control the configuration of all my flows. I do this by first
>> reading the content of the table for this particular flow ID, then
>> assign the result as attributes to the flowfile, essentially creating
>> variables I can use throughout the flow to control its behavior. This
>> works great with flows that initiate with HTTP or SQL, but not
>> ListSFTP or ListFile.
>> 
>> Scott
>> 
>> 
>> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>> 
>> I think Bryan’s point is a good one and when I first saw this
>> question (and thought of the previous times it’s been asked), my
>> initial response is to propose a second processor.
>> 
>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
>> differently from ListSFTP — it does not maintain state, and performs
>> a one-time tabulation/chronicling of the state of that directory at
>> the given point in time.
>> 
>> The responsibility to maintain and compare state across time is no
>> longer a requirement. There could even be a setting in the processor
>> to allow for “individual flowfile output” (i.e. act the same as
>> ListSFTP and output one flowfile per item listed) or “summary
>> flowfile output” where a single flowfile is generated containing the
>> directory listing information for all the items there. (Another
>> option is to output both on two different relationships).
>> 
>> I think this would enable the types of workflows that users have
>> asked about in the past without compromising the mechanism by which
>> List* processors work and adding undue complexity to those processors.
>> 
>> Absolutely crystal clear documentation (and a standard verb for the
>> new processor family) would be necessary (not only because these
>> processor solve different problems, but to avoid a million variants
>> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
>> provide a directory in an attribute to ListSFTP” mailing list
>> questions).
>> 
>> 
>> Andy LoPresto
>> alopresto@apache.org <ma...@apache.org>
>> /alopresto.apache@gmail.com <ma...@gmail.com>/
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
>> <ma...@gmail.com>> wrote:
>> 
>> The key here is that ListXXX processor maintains state. A directory
>> is part
>> of such state. Allowing arbitrary directories via an expression would
>> create never ending stream of new entries in the state storage,
>> effectively
>> engineering a distributed DoS attack on the NiFi node or shared ZK
>> quorum
>> (for when state is stored in there).
>> 
>> Maybe if we focus on thinking about assumptions and restrictions the
>> processor should make to contain that risk...
>> 
>> Andrew
>> 
>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
>> <ma...@gmail.com>> wrote:
>> 
>> I'm not sure that would solve the problem because you'd still be
>> limited to one directory. What most people are asking for is the
>> ability to use a dynamic directory from an incoming flow file.
>> 
>> I think we might be trying to fit two different use-cases into one
>> processor which might not make sense.
>> 
>> Scenario #1... There is a directory that is constantly receiving new
>> data and has a significant amount of files, and I want to
>> 
>> periodically
>> 
>> find new files. This is what the current processors are optimized
>> 
>> for.
>> 
>> Scenario #2... There is a directory that is mostly static with a
>> moderate/small number of files, and at points in my flow I want to
>> dynamically perform a listing of this directory and retrieve the
>> files. This is more geared towards the mentality of running a
>> job/workflow.
>> 
>> 
>> 
>> 
>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
>> <ottobackwards@gmail.com <ma...@gmail.com>>
>> wrote:
>> 
>> What if the changes where ‘on top of’ some base set of properties,
>> like
>> directory?
>> Like a filter, where if present from the incoming file will have
>> 
>> the
>> 
>> LIST*
>> 
>> list only things
>> that match a name or attribute?
>> 
>> 
>> 
>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
>> <ma...@gmail.com>) wrote:
>> 
>> Scott
>> 
>> This idea has come up a couple of times and there is definitely
>> something intriguing to it. Where I think this idea stalls out
>> 
>> though
>> 
>> is in implementation.
>> 
>> While I agree that the other List* processors might similarly
>> 
>> benefit
>> 
>> lets focus on ListFile. Today you tell ListFile what directory to
>> start looking for files in. It goes off scanning that directory for
>> hits and stores state about what it has already searched/seen. And
>> 
>> it
>> 
>> is important to keep track of how much it has already scanned
>> 
>> because
>> 
>> at times the search directory can be massive (100,000s of thousands
>> 
>> or
>> 
>> more files and directories to scan for example).
>> 
>> In the proposed model the directory to be scanned could be provided
>> dynamically by looking at an attribute of an incoming flowfile (or
>> other criteria can be provided - not just the directory to scan).
>> 
>> In
>> 
>> this case the ListFile processor goes on scanning against that now.
>> What about the previous directory (or directories) it was told to
>> scan? Does it still track those too? What if it starts scanning the
>> newly provided directory, hasn't finished pulling all the data or
>> 
>> new
>> 
>> data is continually arriving, and it is told to switch to another
>> directory.
>> 
>> I think if those questions can get solid answers and someone
>> 
>> invests
>> 
>> time in creating a PR then this could be pretty powerful. Would be
>> good to see a written description of the use case(s) for this too.
>> 
>> Thanks
>> Joe
>> 
>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
>> <ma...@gmail.com>> wrote:
>> 
>> Hello Devs,
>> 
>> I would like to request a feature to a major processor, ListSFTP.
>> 
>> But
>> 
>> before
>> 
>> I do down the official road, I wanted to ask if anyone thought it
>> was a
>> terrible idea or impossible, etc. The request is to add support
>> for an
>> incoming relationship to the ListSFTP processor specifically, but
>> 
>> I
>> 
>> could
>> 
>> see it added to many of the commonly used head processes, such as
>> 
>> ListFile.
>> 
>> I would envision functionality more like InvokeHTTP or
>> ExecuteSQL, where
>> 
>> an
>> 
>> incoming flow file could initiate the action, and the attributes
>> in the
>> incoming flow file could be used to configure the processor
>> 
>> actions.
>> 
>> It's
>> 
>> the configuration aspect that most appeals to me, because it
>> opens it up
>> 
>> to
>> 
>> being centrally or dynamically configured.
>> 
>> Thanks,
>> 
>> Scott
>> 
>> 
>> 
>> 
>> 
>>

Re: ListSFTP incoming relationship

Posted by Charlie Meyer <ch...@civitaslearning.com>.

Just a thought,


Could a processor that did the scan and stored state be implemented similar
to GenerateTableFetch, where there is a minimum value attribute that is
specified that could be read from the source (such as created date, updated
date, etc)? That way the state could potentially be manageable.

On Thu, Mar 29, 2018 at 2:43 PM, Andy LoPresto <al...@apache.org> wrote:

> Bryan,
>
> No, that was exactly what I was referencing regarding the attribute
> output. It would have been clearer if I had said it like you did. Thanks.
>
> Andy LoPresto
> alopresto@apache.org
> *alopresto.apache@gmail.com <al...@gmail.com>*
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Mar 29, 2018, at 10:46 AM, Bryan Bende <bb...@gmail.com> wrote:
>
> Scott,
>
> You are correct that the overall discussion is about allowing incoming
> flow files to ListSFTP.
>
> However, the previous discussion on this thread highlighted that the
> main reason ListSFTP currently doesn't allow incoming flow files is
> because of challenges when storing state.
>
> This led to the proposal of a new processor that allowed incoming flow
> files, but did not store state, thus avoiding the challenges mentioned
> above. If we were going to store state in this new processor, then
> we'd be back to the exact same challenges.
>
> Providing an option to turn on state also doesn't really help, because
> if there is an option provided to users,then the option will be used,
> and it needs to work when it is used.
>
> If we can come up with something that stores state and works well for
> all scenarios, then we aren't against it, we just need to handle the
> challenges highlighted by Joe's original email.
>
> Regarding some of the other ideas...
>
> The current output of ListSFTP already includes flow file attributes
> for each listing that include the full path, filename, last update
> time, owner, group, permissions, and file size.... were you thinking
> of something different than that?
>
> See the "Writes Attributes" section here:
> https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.
> processors.standard.ListSFTP/index.html
>
> Thanks,
>
> Bryan
>
>
>
> On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <al...@apache.org>
> wrote:
>
> Scott,
>
> I think there are two conversations going on here. You are finding the
> requirements for your specific use case, and that’s great. But I echo
> Bryan’s point that a community processor for this scenario should not store
> state at all. Sivaprasanna’s point that given dynamic directory input,
> storing state based on that can cause massive data ingestion problems still
> stands.
>
> For your specific use case, you can prototype (or possibly even get to a
> stable and robust-enough point) using ExecuteScript to model the behavior
> you need.
>
> In regards to the desired output format, I would suggest a few items:
>
> * Avro requires a schema to be defined, and this raises the barrier to use
> of the processor. Also, unless being sent to a processor that understands
> Avro, the result will need to be converted anyway using Record* processors.
> * If the output is individual flowfiles on a 1:1 basis, each should have as
> many attributes populated with the parsed information as possible (i.e.
> file.name, file.path, file.size, file.owner, file.permissions, etc.). This
> allows for easily-consumable and routable flowfiles.
> * If the output is a full directory listing, I would suggest `ls -al` type
> raw text output, or JSON (arbitrary human-readable and machine-readable
> format with many consuming/transforming processors).
>
>
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
>
> Sorry Bryan, but I disagree with you. Not storing state is NOT the main
> point of this new processor. The main point is to allow an incoming
> relationship flowfile to trigger the action, and allow variables to be used
> from the attributes therein.
>
> I agree that if the NiFi community deems it too risky to distribute this
> processor with state keeping optionally available, even if the default is
> to
> disable it, then so be it. If state is not included optionally, then how
> about making the output flowfile content include more than just the file
> names? Have it include last updated time along with the filename. If it
> searches recursively, you'll want to include the path to the file also.
> Maybe it would be best to output the results into a structured format, such
> as AVRO? Or, maybe it would just be best to output one flowfile per remote
> file found, and include updated time and fully qualified path as
> attributes?
>
> Scott
>
>
> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>
> The main point of the new processor is to NOT store state so that it
> becomes more reasonable to allow incoming flow files.
>
> You could probably implement your own custom processor that does both
> because you can make assumptions about how you are going to use it, but if
> the NiFi community provides one then it needs to work well for all
> situations, such as dynamically listing hundreds of directories, which is
> problematic when state is involved.
>
> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <si...@gmail.com>
> wrote:
>
> Should we really have to have an optional state saving functionality? If
> the user is unaware of the implications and proceed to store the state then
> what Andrew Grande mentioned will happen - possibilities of never ending
> stream of state information being stored. If we still go with the optional
> state management approach, documentation have to be clear in explaining the
> implications.
>
> Sivaprasanna
>
> On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
>
> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
> where the content of the flow file is replaced with the list of matching
> files from the remote directory, then the list is filtered by the usual
> regex parameters like today. Optional state information is kept to
> additionally filter the list of files older than the newest file
> observed during the last run. Does that sound okay to everyone? If so,
> what's the next step?
>
> Scott
>
>
> On 03/27/2018 06:21 PM, scott wrote:
>
> This is a great discussion, and appreciate the interest in my problem.
> I think there are workarounds if you decide not to store state, but
> I'd recommend keeping it. I think state should be kept optionally,
> even turned off by default. Several times I've had issues where the
> state has cause me to miss files, because files get moved into the
> source folder out of order, and I've wished I could turn the state
> feature off.
>
> In my current use-case, I would not be frequently, dynamically
> changing the source directory, though I can see the use-cases where it
> would be. In my current use-case, I want to use an external database
> table to control the configuration of all my flows. I do this by first
> reading the content of the table for this particular flow ID, then
> assign the result as attributes to the flowfile, essentially creating
> variables I can use throughout the flow to control its behavior. This
> works great with flows that initiate with HTTP or SQL, but not
> ListSFTP or ListFile.
>
> Scott
>
>
> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>
> I think Bryan’s point is a good one and when I first saw this
> question (and thought of the previous times it’s been asked), my
> initial response is to propose a second processor.
>
> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> differently from ListSFTP — it does not maintain state, and performs
> a one-time tabulation/chronicling of the state of that directory at
> the given point in time.
>
> The responsibility to maintain and compare state across time is no
> longer a requirement. There could even be a setting in the processor
> to allow for “individual flowfile output” (i.e. act the same as
> ListSFTP and output one flowfile per item listed) or “summary
> flowfile output” where a single flowfile is generated containing the
> directory listing information for all the items there. (Another
> option is to output both on two different relationships).
>
> I think this would enable the types of workflows that users have
> asked about in the past without compromising the mechanism by which
> List* processors work and adding undue complexity to those processors.
>
> Absolutely crystal clear documentation (and a standard verb for the
> new processor family) would be necessary (not only because these
> processor solve different problems, but to avoid a million variants
> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
> provide a directory in an attribute to ListSFTP” mailing list
> questions).
>
>
> Andy LoPresto
> alopresto@apache.org <ma...@apache.org>
> /alopresto.apache@gmail.com <ma...@gmail.com>/
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
> <ma...@gmail.com>> wrote:
>
> The key here is that ListXXX processor maintains state. A directory
> is part
> of such state. Allowing arbitrary directories via an expression would
> create never ending stream of new entries in the state storage,
> effectively
> engineering a distributed DoS attack on the NiFi node or shared ZK
> quorum
> (for when state is stored in there).
>
> Maybe if we focus on thinking about assumptions and restrictions the
> processor should make to contain that risk...
>
> Andrew
>
> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
> <ma...@gmail.com>> wrote:
>
> I'm not sure that would solve the problem because you'd still be
> limited to one directory. What most people are asking for is the
> ability to use a dynamic directory from an incoming flow file.
>
> I think we might be trying to fit two different use-cases into one
> processor which might not make sense.
>
> Scenario #1... There is a directory that is constantly receiving new
> data and has a significant amount of files, and I want to
>
> periodically
>
> find new files. This is what the current processors are optimized
>
> for.
>
> Scenario #2... There is a directory that is mostly static with a
> moderate/small number of files, and at points in my flow I want to
> dynamically perform a listing of this directory and retrieve the
> files. This is more geared towards the mentality of running a
> job/workflow.
>
>
>
>
> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
> <ottobackwards@gmail.com <ma...@gmail.com>>
> wrote:
>
> What if the changes where ‘on top of’ some base set of properties,
> like
> directory?
> Like a filter, where if present from the incoming file will have
>
> the
>
> LIST*
>
> list only things
> that match a name or attribute?
>
>
>
> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
> <ma...@gmail.com>) wrote:
>
> Scott
>
> This idea has come up a couple of times and there is definitely
> something intriguing to it. Where I think this idea stalls out
>
> though
>
> is in implementation.
>
> While I agree that the other List* processors might similarly
>
> benefit
>
> lets focus on ListFile. Today you tell ListFile what directory to
> start looking for files in. It goes off scanning that directory for
> hits and stores state about what it has already searched/seen. And
>
> it
>
> is important to keep track of how much it has already scanned
>
> because
>
> at times the search directory can be massive (100,000s of thousands
>
> or
>
> more files and directories to scan for example).
>
> In the proposed model the directory to be scanned could be provided
> dynamically by looking at an attribute of an incoming flowfile (or
> other criteria can be provided - not just the directory to scan).
>
> In
>
> this case the ListFile processor goes on scanning against that now.
> What about the previous directory (or directories) it was told to
> scan? Does it still track those too? What if it starts scanning the
> newly provided directory, hasn't finished pulling all the data or
>
> new
>
> data is continually arriving, and it is told to switch to another
> directory.
>
> I think if those questions can get solid answers and someone
>
> invests
>
> time in creating a PR then this could be pretty powerful. Would be
> good to see a written description of the use case(s) for this too.
>
> Thanks
> Joe
>
> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
> <ma...@gmail.com>> wrote:
>
> Hello Devs,
>
> I would like to request a feature to a major processor, ListSFTP.
>
> But
>
> before
>
> I do down the official road, I wanted to ask if anyone thought it
> was a
> terrible idea or impossible, etc. The request is to add support
> for an
> incoming relationship to the ListSFTP processor specifically, but
>
> I
>
> could
>
> see it added to many of the commonly used head processes, such as
>
> ListFile.
>
> I would envision functionality more like InvokeHTTP or
> ExecuteSQL, where
>
> an
>
> incoming flow file could initiate the action, and the attributes
> in the
> incoming flow file could be used to configure the processor
>
> actions.
>
> It's
>
> the configuration aspect that most appeals to me, because it
> opens it up
>
> to
>
> being centrally or dynamically configured.
>
> Thanks,
>
> Scott
>
>
>
>
>
>

Re: ListSFTP incoming relationship

Posted by Andy LoPresto <al...@apache.org>.

Bryan,

No, that was exactly what I was referencing regarding the attribute output. It would have been clearer if I had said it like you did. Thanks.

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 29, 2018, at 10:46 AM, Bryan Bende <bb...@gmail.com> wrote:
> 
> Scott,
> 
> You are correct that the overall discussion is about allowing incoming
> flow files to ListSFTP.
> 
> However, the previous discussion on this thread highlighted that the
> main reason ListSFTP currently doesn't allow incoming flow files is
> because of challenges when storing state.
> 
> This led to the proposal of a new processor that allowed incoming flow
> files, but did not store state, thus avoiding the challenges mentioned
> above. If we were going to store state in this new processor, then
> we'd be back to the exact same challenges.
> 
> Providing an option to turn on state also doesn't really help, because
> if there is an option provided to users,then the option will be used,
> and it needs to work when it is used.
> 
> If we can come up with something that stores state and works well for
> all scenarios, then we aren't against it, we just need to handle the
> challenges highlighted by Joe's original email.
> 
> Regarding some of the other ideas...
> 
> The current output of ListSFTP already includes flow file attributes
> for each listing that include the full path, filename, last update
> time, owner, group, permissions, and file size.... were you thinking
> of something different than that?
> 
> See the "Writes Attributes" section here:
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ListSFTP/index.html
> 
> Thanks,
> 
> Bryan
> 
> 
> 
> On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <al...@apache.org> wrote:
>> Scott,
>> 
>> I think there are two conversations going on here. You are finding the
>> requirements for your specific use case, and that’s great. But I echo
>> Bryan’s point that a community processor for this scenario should not store
>> state at all. Sivaprasanna’s point that given dynamic directory input,
>> storing state based on that can cause massive data ingestion problems still
>> stands.
>> 
>> For your specific use case, you can prototype (or possibly even get to a
>> stable and robust-enough point) using ExecuteScript to model the behavior
>> you need.
>> 
>> In regards to the desired output format, I would suggest a few items:
>> 
>> * Avro requires a schema to be defined, and this raises the barrier to use
>> of the processor. Also, unless being sent to a processor that understands
>> Avro, the result will need to be converted anyway using Record* processors.
>> * If the output is individual flowfiles on a 1:1 basis, each should have as
>> many attributes populated with the parsed information as possible (i.e.
>> file.name, file.path, file.size, file.owner, file.permissions, etc.). This
>> allows for easily-consumable and routable flowfiles.
>> * If the output is a full directory listing, I would suggest `ls -al` type
>> raw text output, or JSON (arbitrary human-readable and machine-readable
>> format with many consuming/transforming processors).
>> 
>> 
>> Andy LoPresto
>> alopresto@apache.org
>> alopresto.apache@gmail.com
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
>> 
>> Sorry Bryan, but I disagree with you. Not storing state is NOT the main
>> point of this new processor. The main point is to allow an incoming
>> relationship flowfile to trigger the action, and allow variables to be used
>> from the attributes therein.
>> 
>> I agree that if the NiFi community deems it too risky to distribute this
>> processor with state keeping optionally available, even if the default is to
>> disable it, then so be it. If state is not included optionally, then how
>> about making the output flowfile content include more than just the file
>> names? Have it include last updated time along with the filename. If it
>> searches recursively, you'll want to include the path to the file also.
>> Maybe it would be best to output the results into a structured format, such
>> as AVRO? Or, maybe it would just be best to output one flowfile per remote
>> file found, and include updated time and fully qualified path as attributes?
>> 
>> Scott
>> 
>> 
>> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>> 
>> The main point of the new processor is to NOT store state so that it
>> becomes more reasonable to allow incoming flow files.
>> 
>> You could probably implement your own custom processor that does both
>> because you can make assumptions about how you are going to use it, but if
>> the NiFi community provides one then it needs to work well for all
>> situations, such as dynamically listing hundreds of directories, which is
>> problematic when state is involved.
>> 
>> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <si...@gmail.com>
>> wrote:
>> 
>> Should we really have to have an optional state saving functionality? If
>> the user is unaware of the implications and proceed to store the state then
>> what Andrew Grande mentioned will happen - possibilities of never ending
>> stream of state information being stored. If we still go with the optional
>> state management approach, documentation have to be clear in explaining the
>> implications.
>> 
>> Sivaprasanna
>> 
>> On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
>> 
>> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
>> where the content of the flow file is replaced with the list of matching
>> files from the remote directory, then the list is filtered by the usual
>> regex parameters like today. Optional state information is kept to
>> additionally filter the list of files older than the newest file
>> observed during the last run. Does that sound okay to everyone? If so,
>> what's the next step?
>> 
>> Scott
>> 
>> 
>> On 03/27/2018 06:21 PM, scott wrote:
>> 
>> This is a great discussion, and appreciate the interest in my problem.
>> I think there are workarounds if you decide not to store state, but
>> I'd recommend keeping it. I think state should be kept optionally,
>> even turned off by default. Several times I've had issues where the
>> state has cause me to miss files, because files get moved into the
>> source folder out of order, and I've wished I could turn the state
>> feature off.
>> 
>> In my current use-case, I would not be frequently, dynamically
>> changing the source directory, though I can see the use-cases where it
>> would be. In my current use-case, I want to use an external database
>> table to control the configuration of all my flows. I do this by first
>> reading the content of the table for this particular flow ID, then
>> assign the result as attributes to the flowfile, essentially creating
>> variables I can use throughout the flow to control its behavior. This
>> works great with flows that initiate with HTTP or SQL, but not
>> ListSFTP or ListFile.
>> 
>> Scott
>> 
>> 
>> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>> 
>> I think Bryan’s point is a good one and when I first saw this
>> question (and thought of the previous times it’s been asked), my
>> initial response is to propose a second processor.
>> 
>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
>> differently from ListSFTP — it does not maintain state, and performs
>> a one-time tabulation/chronicling of the state of that directory at
>> the given point in time.
>> 
>> The responsibility to maintain and compare state across time is no
>> longer a requirement. There could even be a setting in the processor
>> to allow for “individual flowfile output” (i.e. act the same as
>> ListSFTP and output one flowfile per item listed) or “summary
>> flowfile output” where a single flowfile is generated containing the
>> directory listing information for all the items there. (Another
>> option is to output both on two different relationships).
>> 
>> I think this would enable the types of workflows that users have
>> asked about in the past without compromising the mechanism by which
>> List* processors work and adding undue complexity to those processors.
>> 
>> Absolutely crystal clear documentation (and a standard verb for the
>> new processor family) would be necessary (not only because these
>> processor solve different problems, but to avoid a million variants
>> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
>> provide a directory in an attribute to ListSFTP” mailing list
>> questions).
>> 
>> 
>> Andy LoPresto
>> alopresto@apache.org <ma...@apache.org>
>> /alopresto.apache@gmail.com <ma...@gmail.com>/
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
>> <ma...@gmail.com>> wrote:
>> 
>> The key here is that ListXXX processor maintains state. A directory
>> is part
>> of such state. Allowing arbitrary directories via an expression would
>> create never ending stream of new entries in the state storage,
>> effectively
>> engineering a distributed DoS attack on the NiFi node or shared ZK
>> quorum
>> (for when state is stored in there).
>> 
>> Maybe if we focus on thinking about assumptions and restrictions the
>> processor should make to contain that risk...
>> 
>> Andrew
>> 
>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
>> <ma...@gmail.com>> wrote:
>> 
>> I'm not sure that would solve the problem because you'd still be
>> limited to one directory. What most people are asking for is the
>> ability to use a dynamic directory from an incoming flow file.
>> 
>> I think we might be trying to fit two different use-cases into one
>> processor which might not make sense.
>> 
>> Scenario #1... There is a directory that is constantly receiving new
>> data and has a significant amount of files, and I want to
>> 
>> periodically
>> 
>> find new files. This is what the current processors are optimized
>> 
>> for.
>> 
>> Scenario #2... There is a directory that is mostly static with a
>> moderate/small number of files, and at points in my flow I want to
>> dynamically perform a listing of this directory and retrieve the
>> files. This is more geared towards the mentality of running a
>> job/workflow.
>> 
>> 
>> 
>> 
>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
>> <ottobackwards@gmail.com <ma...@gmail.com>>
>> wrote:
>> 
>> What if the changes where ‘on top of’ some base set of properties,
>> like
>> directory?
>> Like a filter, where if present from the incoming file will have
>> 
>> the
>> 
>> LIST*
>> 
>> list only things
>> that match a name or attribute?
>> 
>> 
>> 
>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
>> <ma...@gmail.com>) wrote:
>> 
>> Scott
>> 
>> This idea has come up a couple of times and there is definitely
>> something intriguing to it. Where I think this idea stalls out
>> 
>> though
>> 
>> is in implementation.
>> 
>> While I agree that the other List* processors might similarly
>> 
>> benefit
>> 
>> lets focus on ListFile. Today you tell ListFile what directory to
>> start looking for files in. It goes off scanning that directory for
>> hits and stores state about what it has already searched/seen. And
>> 
>> it
>> 
>> is important to keep track of how much it has already scanned
>> 
>> because
>> 
>> at times the search directory can be massive (100,000s of thousands
>> 
>> or
>> 
>> more files and directories to scan for example).
>> 
>> In the proposed model the directory to be scanned could be provided
>> dynamically by looking at an attribute of an incoming flowfile (or
>> other criteria can be provided - not just the directory to scan).
>> 
>> In
>> 
>> this case the ListFile processor goes on scanning against that now.
>> What about the previous directory (or directories) it was told to
>> scan? Does it still track those too? What if it starts scanning the
>> newly provided directory, hasn't finished pulling all the data or
>> 
>> new
>> 
>> data is continually arriving, and it is told to switch to another
>> directory.
>> 
>> I think if those questions can get solid answers and someone
>> 
>> invests
>> 
>> time in creating a PR then this could be pretty powerful. Would be
>> good to see a written description of the use case(s) for this too.
>> 
>> Thanks
>> Joe
>> 
>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
>> <ma...@gmail.com>> wrote:
>> 
>> Hello Devs,
>> 
>> I would like to request a feature to a major processor, ListSFTP.
>> 
>> But
>> 
>> before
>> 
>> I do down the official road, I wanted to ask if anyone thought it
>> was a
>> terrible idea or impossible, etc. The request is to add support
>> for an
>> incoming relationship to the ListSFTP processor specifically, but
>> 
>> I
>> 
>> could
>> 
>> see it added to many of the commonly used head processes, such as
>> 
>> ListFile.
>> 
>> I would envision functionality more like InvokeHTTP or
>> ExecuteSQL, where
>> 
>> an
>> 
>> incoming flow file could initiate the action, and the attributes
>> in the
>> incoming flow file could be used to configure the processor
>> 
>> actions.
>> 
>> It's
>> 
>> the configuration aspect that most appeals to me, because it
>> opens it up
>> 
>> to
>> 
>> being centrally or dynamically configured.
>> 
>> Thanks,
>> 
>> Scott
>> 
>> 
>> 
>>

Re: ListSFTP incoming relationship

Posted by Bryan Bende <bb...@gmail.com>.

Scott,

You are correct that the overall discussion is about allowing incoming
flow files to ListSFTP.

However, the previous discussion on this thread highlighted that the
main reason ListSFTP currently doesn't allow incoming flow files is
because of challenges when storing state.

This led to the proposal of a new processor that allowed incoming flow
files, but did not store state, thus avoiding the challenges mentioned
above. If we were going to store state in this new processor, then
we'd be back to the exact same challenges.

Providing an option to turn on state also doesn't really help, because
if there is an option provided to users,then the option will be used,
and it needs to work when it is used.

If we can come up with something that stores state and works well for
all scenarios, then we aren't against it, we just need to handle the
challenges highlighted by Joe's original email.

Regarding some of the other ideas...

The current output of ListSFTP already includes flow file attributes
for each listing that include the full path, filename, last update
time, owner, group, permissions, and file size.... were you thinking
of something different than that?

See the "Writes Attributes" section here:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ListSFTP/index.html

Thanks,

Bryan



On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <al...@apache.org> wrote:
> Scott,
>
> I think there are two conversations going on here. You are finding the
> requirements for your specific use case, and that’s great. But I echo
> Bryan’s point that a community processor for this scenario should not store
> state at all. Sivaprasanna’s point that given dynamic directory input,
> storing state based on that can cause massive data ingestion problems still
> stands.
>
> For your specific use case, you can prototype (or possibly even get to a
> stable and robust-enough point) using ExecuteScript to model the behavior
> you need.
>
> In regards to the desired output format, I would suggest a few items:
>
> * Avro requires a schema to be defined, and this raises the barrier to use
> of the processor. Also, unless being sent to a processor that understands
> Avro, the result will need to be converted anyway using Record* processors.
> * If the output is individual flowfiles on a 1:1 basis, each should have as
> many attributes populated with the parsed information as possible (i.e.
> file.name, file.path, file.size, file.owner, file.permissions, etc.). This
> allows for easily-consumable and routable flowfiles.
> * If the output is a full directory listing, I would suggest `ls -al` type
> raw text output, or JSON (arbitrary human-readable and machine-readable
> format with many consuming/transforming processors).
>
>
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
>
> Sorry Bryan, but I disagree with you. Not storing state is NOT the main
> point of this new processor. The main point is to allow an incoming
> relationship flowfile to trigger the action, and allow variables to be used
> from the attributes therein.
>
> I agree that if the NiFi community deems it too risky to distribute this
> processor with state keeping optionally available, even if the default is to
> disable it, then so be it. If state is not included optionally, then how
> about making the output flowfile content include more than just the file
> names? Have it include last updated time along with the filename. If it
> searches recursively, you'll want to include the path to the file also.
> Maybe it would be best to output the results into a structured format, such
> as AVRO? Or, maybe it would just be best to output one flowfile per remote
> file found, and include updated time and fully qualified path as attributes?
>
> Scott
>
>
> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>
> The main point of the new processor is to NOT store state so that it
> becomes more reasonable to allow incoming flow files.
>
> You could probably implement your own custom processor that does both
> because you can make assumptions about how you are going to use it, but if
> the NiFi community provides one then it needs to work well for all
> situations, such as dynamically listing hundreds of directories, which is
> problematic when state is involved.
>
> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <si...@gmail.com>
> wrote:
>
> Should we really have to have an optional state saving functionality? If
> the user is unaware of the implications and proceed to store the state then
> what Andrew Grande mentioned will happen - possibilities of never ending
> stream of state information being stored. If we still go with the optional
> state management approach, documentation have to be clear in explaining the
> implications.
>
> Sivaprasanna
>
> On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
>
> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
> where the content of the flow file is replaced with the list of matching
> files from the remote directory, then the list is filtered by the usual
> regex parameters like today. Optional state information is kept to
> additionally filter the list of files older than the newest file
> observed during the last run. Does that sound okay to everyone? If so,
> what's the next step?
>
> Scott
>
>
> On 03/27/2018 06:21 PM, scott wrote:
>
> This is a great discussion, and appreciate the interest in my problem.
> I think there are workarounds if you decide not to store state, but
> I'd recommend keeping it. I think state should be kept optionally,
> even turned off by default. Several times I've had issues where the
> state has cause me to miss files, because files get moved into the
> source folder out of order, and I've wished I could turn the state
> feature off.
>
> In my current use-case, I would not be frequently, dynamically
> changing the source directory, though I can see the use-cases where it
> would be. In my current use-case, I want to use an external database
> table to control the configuration of all my flows. I do this by first
> reading the content of the table for this particular flow ID, then
> assign the result as attributes to the flowfile, essentially creating
> variables I can use throughout the flow to control its behavior. This
> works great with flows that initiate with HTTP or SQL, but not
> ListSFTP or ListFile.
>
> Scott
>
>
> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>
> I think Bryan’s point is a good one and when I first saw this
> question (and thought of the previous times it’s been asked), my
> initial response is to propose a second processor.
>
> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> differently from ListSFTP — it does not maintain state, and performs
> a one-time tabulation/chronicling of the state of that directory at
> the given point in time.
>
> The responsibility to maintain and compare state across time is no
> longer a requirement. There could even be a setting in the processor
> to allow for “individual flowfile output” (i.e. act the same as
> ListSFTP and output one flowfile per item listed) or “summary
> flowfile output” where a single flowfile is generated containing the
> directory listing information for all the items there. (Another
> option is to output both on two different relationships).
>
> I think this would enable the types of workflows that users have
> asked about in the past without compromising the mechanism by which
> List* processors work and adding undue complexity to those processors.
>
> Absolutely crystal clear documentation (and a standard verb for the
> new processor family) would be necessary (not only because these
> processor solve different problems, but to avoid a million variants
> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
> provide a directory in an attribute to ListSFTP” mailing list
> questions).
>
>
> Andy LoPresto
> alopresto@apache.org <ma...@apache.org>
> /alopresto.apache@gmail.com <ma...@gmail.com>/
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
> <ma...@gmail.com>> wrote:
>
> The key here is that ListXXX processor maintains state. A directory
> is part
> of such state. Allowing arbitrary directories via an expression would
> create never ending stream of new entries in the state storage,
> effectively
> engineering a distributed DoS attack on the NiFi node or shared ZK
> quorum
> (for when state is stored in there).
>
> Maybe if we focus on thinking about assumptions and restrictions the
> processor should make to contain that risk...
>
> Andrew
>
> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
> <ma...@gmail.com>> wrote:
>
> I'm not sure that would solve the problem because you'd still be
> limited to one directory. What most people are asking for is the
> ability to use a dynamic directory from an incoming flow file.
>
> I think we might be trying to fit two different use-cases into one
> processor which might not make sense.
>
> Scenario #1... There is a directory that is constantly receiving new
> data and has a significant amount of files, and I want to
>
> periodically
>
> find new files. This is what the current processors are optimized
>
> for.
>
> Scenario #2... There is a directory that is mostly static with a
> moderate/small number of files, and at points in my flow I want to
> dynamically perform a listing of this directory and retrieve the
> files. This is more geared towards the mentality of running a
> job/workflow.
>
>
>
>
> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
> <ottobackwards@gmail.com <ma...@gmail.com>>
> wrote:
>
> What if the changes where ‘on top of’ some base set of properties,
> like
> directory?
> Like a filter, where if present from the incoming file will have
>
> the
>
> LIST*
>
> list only things
> that match a name or attribute?
>
>
>
> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
> <ma...@gmail.com>) wrote:
>
> Scott
>
> This idea has come up a couple of times and there is definitely
> something intriguing to it. Where I think this idea stalls out
>
> though
>
> is in implementation.
>
> While I agree that the other List* processors might similarly
>
> benefit
>
> lets focus on ListFile. Today you tell ListFile what directory to
> start looking for files in. It goes off scanning that directory for
> hits and stores state about what it has already searched/seen. And
>
> it
>
> is important to keep track of how much it has already scanned
>
> because
>
> at times the search directory can be massive (100,000s of thousands
>
> or
>
> more files and directories to scan for example).
>
> In the proposed model the directory to be scanned could be provided
> dynamically by looking at an attribute of an incoming flowfile (or
> other criteria can be provided - not just the directory to scan).
>
> In
>
> this case the ListFile processor goes on scanning against that now.
> What about the previous directory (or directories) it was told to
> scan? Does it still track those too? What if it starts scanning the
> newly provided directory, hasn't finished pulling all the data or
>
> new
>
> data is continually arriving, and it is told to switch to another
> directory.
>
> I think if those questions can get solid answers and someone
>
> invests
>
> time in creating a PR then this could be pretty powerful. Would be
> good to see a written description of the use case(s) for this too.
>
> Thanks
> Joe
>
> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
> <ma...@gmail.com>> wrote:
>
> Hello Devs,
>
> I would like to request a feature to a major processor, ListSFTP.
>
> But
>
> before
>
> I do down the official road, I wanted to ask if anyone thought it
> was a
> terrible idea or impossible, etc. The request is to add support
> for an
> incoming relationship to the ListSFTP processor specifically, but
>
> I
>
> could
>
> see it added to many of the commonly used head processes, such as
>
> ListFile.
>
> I would envision functionality more like InvokeHTTP or
> ExecuteSQL, where
>
> an
>
> incoming flow file could initiate the action, and the attributes
> in the
> incoming flow file could be used to configure the processor
>
> actions.
>
> It's
>
> the configuration aspect that most appeals to me, because it
> opens it up
>
> to
>
> being centrally or dynamically configured.
>
> Thanks,
>
> Scott
>
>
>
>

Re: ListSFTP incoming relationship

Posted by Andy LoPresto <al...@apache.org>.

Scott,

I think there are two conversations going on here. You are finding the requirements for your specific use case, and that’s great. But I echo Bryan’s point that a community processor for this scenario should not store state at all. Sivaprasanna’s point that given dynamic directory input, storing state based on that can cause massive data ingestion problems still stands.

For your specific use case, you can prototype (or possibly even get to a stable and robust-enough point) using ExecuteScript to model the behavior you need.

In regards to the desired output format, I would suggest a few items:

* Avro requires a schema to be defined, and this raises the barrier to use of the processor. Also, unless being sent to a processor that understands Avro, the result will need to be converted anyway using Record* processors.
* If the output is individual flowfiles on a 1:1 basis, each should have as many attributes populated with the parsed information as possible (i.e. file.name, file.path, file.size, file.owner, file.permissions, etc.). This allows for easily-consumable and routable flowfiles.
* If the output is a full directory listing, I would suggest `ls -al` type raw text output, or JSON (arbitrary human-readable and machine-readable format with many consuming/transforming processors).


Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 29, 2018, at 9:34 AM, scott <tc...@gmail.com> wrote:
> 
> Sorry Bryan, but I disagree with you. Not storing state is NOT the main point of this new processor. The main point is to allow an incoming relationship flowfile to trigger the action, and allow variables to be used from the attributes therein.
> 
> I agree that if the NiFi community deems it too risky to distribute this processor with state keeping optionally available, even if the default is to disable it, then so be it. If state is not included optionally, then how about making the output flowfile content include more than just the file names? Have it include last updated time along with the filename. If it searches recursively, you'll want to include the path to the file also. Maybe it would be best to output the results into a structured format, such as AVRO? Or, maybe it would just be best to output one flowfile per remote file found, and include updated time and fully qualified path as attributes?
> 
> Scott
> 
> 
> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>> The main point of the new processor is to NOT store state so that it
>> becomes more reasonable to allow incoming flow files.
>> 
>> You could probably implement your own custom processor that does both
>> because you can make assumptions about how you are going to use it, but if
>> the NiFi community provides one then it needs to work well for all
>> situations, such as dynamically listing hundreds of directories, which is
>> problematic when state is involved.
>> 
>> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <si...@gmail.com>
>> wrote:
>> 
>>> Should we really have to have an optional state saving functionality? If
>>> the user is unaware of the implications and proceed to store the state then
>>> what Andrew Grande mentioned will happen - possibilities of never ending
>>> stream of state information being stored. If we still go with the optional
>>> state management approach, documentation have to be clear in explaining the
>>> implications.
>>> 
>>> Sivaprasanna
>>> 
>>> On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
>>> 
>>>> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
>>>> where the content of the flow file is replaced with the list of matching
>>>> files from the remote directory, then the list is filtered by the usual
>>>> regex parameters like today. Optional state information is kept to
>>>> additionally filter the list of files older than the newest file
>>>> observed during the last run. Does that sound okay to everyone? If so,
>>>> what's the next step?
>>>> 
>>>> Scott
>>>> 
>>>> 
>>>> On 03/27/2018 06:21 PM, scott wrote:
>>>>> This is a great discussion, and appreciate the interest in my problem.
>>>>> I think there are workarounds if you decide not to store state, but
>>>>> I'd recommend keeping it. I think state should be kept optionally,
>>>>> even turned off by default. Several times I've had issues where the
>>>>> state has cause me to miss files, because files get moved into the
>>>>> source folder out of order, and I've wished I could turn the state
>>>>> feature off.
>>>>> 
>>>>> In my current use-case, I would not be frequently, dynamically
>>>>> changing the source directory, though I can see the use-cases where it
>>>>> would be. In my current use-case, I want to use an external database
>>>>> table to control the configuration of all my flows. I do this by first
>>>>> reading the content of the table for this particular flow ID, then
>>>>> assign the result as attributes to the flowfile, essentially creating
>>>>> variables I can use throughout the flow to control its behavior. This
>>>>> works great with flows that initiate with HTTP or SQL, but not
>>>>> ListSFTP or ListFile.
>>>>> 
>>>>> Scott
>>>>> 
>>>>> 
>>>>> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>>>>>> I think Bryan’s point is a good one and when I first saw this
>>>>>> question (and thought of the previous times it’s been asked), my
>>>>>> initial response is to propose a second processor.
>>>>>> 
>>>>>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
>>>>>> differently from ListSFTP — it does not maintain state, and performs
>>>>>> a one-time tabulation/chronicling of the state of that directory at
>>>>>> the given point in time.
>>>>>> 
>>>>>> The responsibility to maintain and compare state across time is no
>>>>>> longer a requirement. There could even be a setting in the processor
>>>>>> to allow for “individual flowfile output” (i.e. act the same as
>>>>>> ListSFTP and output one flowfile per item listed) or “summary
>>>>>> flowfile output” where a single flowfile is generated containing the
>>>>>> directory listing information for all the items there. (Another
>>>>>> option is to output both on two different relationships).
>>>>>> 
>>>>>> I think this would enable the types of workflows that users have
>>>>>> asked about in the past without compromising the mechanism by which
>>>>>> List* processors work and adding undue complexity to those processors.
>>>>>> 
>>>>>> Absolutely crystal clear documentation (and a standard verb for the
>>>>>> new processor family) would be necessary (not only because these
>>>>>> processor solve different problems, but to avoid a million variants
>>>>>> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
>>>>>> provide a directory in an attribute to ListSFTP” mailing list
>>>>>> questions).
>>>>>> 
>>>>>> 
>>>>>> Andy LoPresto
>>>>>> alopresto@apache.org <ma...@apache.org>
>>>>>> /alopresto.apache@gmail.com <ma...@gmail.com>/
>>>>>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>>>>> 
>>>>>>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> The key here is that ListXXX processor maintains state. A directory
>>>>>>> is part
>>>>>>> of such state. Allowing arbitrary directories via an expression would
>>>>>>> create never ending stream of new entries in the state storage,
>>>>>>> effectively
>>>>>>> engineering a distributed DoS attack on the NiFi node or shared ZK
>>>>>>> quorum
>>>>>>> (for when state is stored in there).
>>>>>>> 
>>>>>>> Maybe if we focus on thinking about assumptions and restrictions the
>>>>>>> processor should make to contain that risk...
>>>>>>> 
>>>>>>> Andrew
>>>>>>> 
>>>>>>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>>> I'm not sure that would solve the problem because you'd still be
>>>>>>>> limited to one directory. What most people are asking for is the
>>>>>>>> ability to use a dynamic directory from an incoming flow file.
>>>>>>>> 
>>>>>>>> I think we might be trying to fit two different use-cases into one
>>>>>>>> processor which might not make sense.
>>>>>>>> 
>>>>>>>> Scenario #1... There is a directory that is constantly receiving new
>>>>>>>> data and has a significant amount of files, and I want to
>>> periodically
>>>>>>>> find new files. This is what the current processors are optimized
>>> for.
>>>>>>>> Scenario #2... There is a directory that is mostly static with a
>>>>>>>> moderate/small number of files, and at points in my flow I want to
>>>>>>>> dynamically perform a listing of this directory and retrieve the
>>>>>>>> files. This is more geared towards the mentality of running a
>>>>>>>> job/workflow.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
>>>>>>>> <ottobackwards@gmail.com <ma...@gmail.com>>
>>>>>>>> wrote:
>>>>>>>>> What if the changes where ‘on top of’ some base set of properties,
>>>>>>>>> like
>>>>>>>>> directory?
>>>>>>>>> Like a filter, where if present from the incoming file will have
>>> the
>>>>>>>> LIST*
>>>>>>>>> list only things
>>>>>>>>> that match a name or attribute?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
>>>>>>>>> <ma...@gmail.com>) wrote:
>>>>>>>>> 
>>>>>>>>> Scott
>>>>>>>>> 
>>>>>>>>> This idea has come up a couple of times and there is definitely
>>>>>>>>> something intriguing to it. Where I think this idea stalls out
>>> though
>>>>>>>>> is in implementation.
>>>>>>>>> 
>>>>>>>>> While I agree that the other List* processors might similarly
>>> benefit
>>>>>>>>> lets focus on ListFile. Today you tell ListFile what directory to
>>>>>>>>> start looking for files in. It goes off scanning that directory for
>>>>>>>>> hits and stores state about what it has already searched/seen. And
>>> it
>>>>>>>>> is important to keep track of how much it has already scanned
>>> because
>>>>>>>>> at times the search directory can be massive (100,000s of thousands
>>>> or
>>>>>>>>> more files and directories to scan for example).
>>>>>>>>> 
>>>>>>>>> In the proposed model the directory to be scanned could be provided
>>>>>>>>> dynamically by looking at an attribute of an incoming flowfile (or
>>>>>>>>> other criteria can be provided - not just the directory to scan).
>>> In
>>>>>>>>> this case the ListFile processor goes on scanning against that now.
>>>>>>>>> What about the previous directory (or directories) it was told to
>>>>>>>>> scan? Does it still track those too? What if it starts scanning the
>>>>>>>>> newly provided directory, hasn't finished pulling all the data or
>>> new
>>>>>>>>> data is continually arriving, and it is told to switch to another
>>>>>>>>> directory.
>>>>>>>>> 
>>>>>>>>> I think if those questions can get solid answers and someone
>>> invests
>>>>>>>>> time in creating a PR then this could be pretty powerful. Would be
>>>>>>>>> good to see a written description of the use case(s) for this too.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Joe
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>> Hello Devs,
>>>>>>>>>> 
>>>>>>>>>> I would like to request a feature to a major processor, ListSFTP.
>>>> But
>>>>>>>>> before
>>>>>>>>>> I do down the official road, I wanted to ask if anyone thought it
>>>>>>>>>> was a
>>>>>>>>>> terrible idea or impossible, etc. The request is to add support
>>>>>>>>>> for an
>>>>>>>>>> incoming relationship to the ListSFTP processor specifically, but
>>> I
>>>>>>>> could
>>>>>>>>>> see it added to many of the commonly used head processes, such as
>>>>>>>>> ListFile.
>>>>>>>>>> I would envision functionality more like InvokeHTTP or
>>>>>>>>>> ExecuteSQL, where
>>>>>>>>> an
>>>>>>>>>> incoming flow file could initiate the action, and the attributes
>>>>>>>>>> in the
>>>>>>>>>> incoming flow file could be used to configure the processor
>>> actions.
>>>>>>>> It's
>>>>>>>>>> the configuration aspect that most appeals to me, because it
>>>>>>>>>> opens it up
>>>>>>>>> to
>>>>>>>>>> being centrally or dynamically configured.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> Scott
>>>>>>>>>> 
>>>> 
>

Re: ListSFTP incoming relationship

Posted by scott <tc...@gmail.com>.

Sorry Bryan, but I disagree with you. Not storing state is NOT the main 
point of this new processor. The main point is to allow an incoming 
relationship flowfile to trigger the action, and allow variables to be 
used from the attributes therein.

I agree that if the NiFi community deems it too risky to distribute this 
processor with state keeping optionally available, even if the default 
is to disable it, then so be it. If state is not included optionally, 
then how about making the output flowfile content include more than just 
the file names? Have it include last updated time along with the 
filename. If it searches recursively, you'll want to include the path to 
the file also. Maybe it would be best to output the results into a 
structured format, such as AVRO? Or, maybe it would just be best to 
output one flowfile per remote file found, and include updated time and 
fully qualified path as attributes?

Scott


On 03/29/2018 04:32 AM, Bryan Bende wrote:
> The main point of the new processor is to NOT store state so that it
> becomes more reasonable to allow incoming flow files.
>
> You could probably implement your own custom processor that does both
> because you can make assumptions about how you are going to use it, but if
> the NiFi community provides one then it needs to work well for all
> situations, such as dynamically listing hundreds of directories, which is
> problematic when state is involved.
>
> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <si...@gmail.com>
> wrote:
>
>> Should we really have to have an optional state saving functionality? If
>> the user is unaware of the implications and proceed to store the state then
>> what Andrew Grande mentioned will happen - possibilities of never ending
>> stream of state information being stored. If we still go with the optional
>> state management approach, documentation have to be clear in explaining the
>> implications.
>>
>> Sivaprasanna
>>
>> On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
>>
>>> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
>>> where the content of the flow file is replaced with the list of matching
>>> files from the remote directory, then the list is filtered by the usual
>>> regex parameters like today. Optional state information is kept to
>>> additionally filter the list of files older than the newest file
>>> observed during the last run. Does that sound okay to everyone? If so,
>>> what's the next step?
>>>
>>> Scott
>>>
>>>
>>> On 03/27/2018 06:21 PM, scott wrote:
>>>> This is a great discussion, and appreciate the interest in my problem.
>>>> I think there are workarounds if you decide not to store state, but
>>>> I'd recommend keeping it. I think state should be kept optionally,
>>>> even turned off by default. Several times I've had issues where the
>>>> state has cause me to miss files, because files get moved into the
>>>> source folder out of order, and I've wished I could turn the state
>>>> feature off.
>>>>
>>>> In my current use-case, I would not be frequently, dynamically
>>>> changing the source directory, though I can see the use-cases where it
>>>> would be. In my current use-case, I want to use an external database
>>>> table to control the configuration of all my flows. I do this by first
>>>> reading the content of the table for this particular flow ID, then
>>>> assign the result as attributes to the flowfile, essentially creating
>>>> variables I can use throughout the flow to control its behavior. This
>>>> works great with flows that initiate with HTTP or SQL, but not
>>>> ListSFTP or ListFile.
>>>>
>>>> Scott
>>>>
>>>>
>>>> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>>>>> I think Bryan’s point is a good one and when I first saw this
>>>>> question (and thought of the previous times it’s been asked), my
>>>>> initial response is to propose a second processor.
>>>>>
>>>>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
>>>>> differently from ListSFTP — it does not maintain state, and performs
>>>>> a one-time tabulation/chronicling of the state of that directory at
>>>>> the given point in time.
>>>>>
>>>>> The responsibility to maintain and compare state across time is no
>>>>> longer a requirement. There could even be a setting in the processor
>>>>> to allow for “individual flowfile output” (i.e. act the same as
>>>>> ListSFTP and output one flowfile per item listed) or “summary
>>>>> flowfile output” where a single flowfile is generated containing the
>>>>> directory listing information for all the items there. (Another
>>>>> option is to output both on two different relationships).
>>>>>
>>>>> I think this would enable the types of workflows that users have
>>>>> asked about in the past without compromising the mechanism by which
>>>>> List* processors work and adding undue complexity to those processors.
>>>>>
>>>>> Absolutely crystal clear documentation (and a standard verb for the
>>>>> new processor family) would be necessary (not only because these
>>>>> processor solve different problems, but to avoid a million variants
>>>>> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
>>>>> provide a directory in an attribute to ListSFTP” mailing list
>>>>> questions).
>>>>>
>>>>>
>>>>> Andy LoPresto
>>>>> alopresto@apache.org <ma...@apache.org>
>>>>> /alopresto.apache@gmail.com <ma...@gmail.com>/
>>>>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>>>>
>>>>>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
>>>>>> <ma...@gmail.com>> wrote:
>>>>>>
>>>>>> The key here is that ListXXX processor maintains state. A directory
>>>>>> is part
>>>>>> of such state. Allowing arbitrary directories via an expression would
>>>>>> create never ending stream of new entries in the state storage,
>>>>>> effectively
>>>>>> engineering a distributed DoS attack on the NiFi node or shared ZK
>>>>>> quorum
>>>>>> (for when state is stored in there).
>>>>>>
>>>>>> Maybe if we focus on thinking about assumptions and restrictions the
>>>>>> processor should make to contain that risk...
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
>>>>>> <ma...@gmail.com>> wrote:
>>>>>>
>>>>>>> I'm not sure that would solve the problem because you'd still be
>>>>>>> limited to one directory. What most people are asking for is the
>>>>>>> ability to use a dynamic directory from an incoming flow file.
>>>>>>>
>>>>>>> I think we might be trying to fit two different use-cases into one
>>>>>>> processor which might not make sense.
>>>>>>>
>>>>>>> Scenario #1... There is a directory that is constantly receiving new
>>>>>>> data and has a significant amount of files, and I want to
>> periodically
>>>>>>> find new files. This is what the current processors are optimized
>> for.
>>>>>>> Scenario #2... There is a directory that is mostly static with a
>>>>>>> moderate/small number of files, and at points in my flow I want to
>>>>>>> dynamically perform a listing of this directory and retrieve the
>>>>>>> files. This is more geared towards the mentality of running a
>>>>>>> job/workflow.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
>>>>>>> <ottobackwards@gmail.com <ma...@gmail.com>>
>>>>>>> wrote:
>>>>>>>> What if the changes where ‘on top of’ some base set of properties,
>>>>>>>> like
>>>>>>>> directory?
>>>>>>>> Like a filter, where if present from the incoming file will have
>> the
>>>>>>> LIST*
>>>>>>>> list only things
>>>>>>>> that match a name or attribute?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
>>>>>>>> <ma...@gmail.com>) wrote:
>>>>>>>>
>>>>>>>> Scott
>>>>>>>>
>>>>>>>> This idea has come up a couple of times and there is definitely
>>>>>>>> something intriguing to it. Where I think this idea stalls out
>> though
>>>>>>>> is in implementation.
>>>>>>>>
>>>>>>>> While I agree that the other List* processors might similarly
>> benefit
>>>>>>>> lets focus on ListFile. Today you tell ListFile what directory to
>>>>>>>> start looking for files in. It goes off scanning that directory for
>>>>>>>> hits and stores state about what it has already searched/seen. And
>> it
>>>>>>>> is important to keep track of how much it has already scanned
>> because
>>>>>>>> at times the search directory can be massive (100,000s of thousands
>>> or
>>>>>>>> more files and directories to scan for example).
>>>>>>>>
>>>>>>>> In the proposed model the directory to be scanned could be provided
>>>>>>>> dynamically by looking at an attribute of an incoming flowfile (or
>>>>>>>> other criteria can be provided - not just the directory to scan).
>> In
>>>>>>>> this case the ListFile processor goes on scanning against that now.
>>>>>>>> What about the previous directory (or directories) it was told to
>>>>>>>> scan? Does it still track those too? What if it starts scanning the
>>>>>>>> newly provided directory, hasn't finished pulling all the data or
>> new
>>>>>>>> data is continually arriving, and it is told to switch to another
>>>>>>>> directory.
>>>>>>>>
>>>>>>>> I think if those questions can get solid answers and someone
>> invests
>>>>>>>> time in creating a PR then this could be pretty powerful. Would be
>>>>>>>> good to see a written description of the use case(s) for this too.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Joe
>>>>>>>>
>>>>>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>> Hello Devs,
>>>>>>>>>
>>>>>>>>> I would like to request a feature to a major processor, ListSFTP.
>>> But
>>>>>>>> before
>>>>>>>>> I do down the official road, I wanted to ask if anyone thought it
>>>>>>>>> was a
>>>>>>>>> terrible idea or impossible, etc. The request is to add support
>>>>>>>>> for an
>>>>>>>>> incoming relationship to the ListSFTP processor specifically, but
>> I
>>>>>>> could
>>>>>>>>> see it added to many of the commonly used head processes, such as
>>>>>>>> ListFile.
>>>>>>>>> I would envision functionality more like InvokeHTTP or
>>>>>>>>> ExecuteSQL, where
>>>>>>>> an
>>>>>>>>> incoming flow file could initiate the action, and the attributes
>>>>>>>>> in the
>>>>>>>>> incoming flow file could be used to configure the processor
>> actions.
>>>>>>> It's
>>>>>>>>> the configuration aspect that most appeals to me, because it
>>>>>>>>> opens it up
>>>>>>>> to
>>>>>>>>> being centrally or dynamically configured.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Scott
>>>>>>>>>
>>>

Re: ListSFTP incoming relationship

Posted by Bryan Bende <bb...@gmail.com>.

The main point of the new processor is to NOT store state so that it
becomes more reasonable to allow incoming flow files.

You could probably implement your own custom processor that does both
because you can make assumptions about how you are going to use it, but if
the NiFi community provides one then it needs to work well for all
situations, such as dynamically listing hundreds of directories, which is
problematic when state is involved.

On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <si...@gmail.com>
wrote:

> Should we really have to have an optional state saving functionality? If
> the user is unaware of the implications and proceed to store the state then
> what Andrew Grande mentioned will happen - possibilities of never ending
> stream of state information being stored. If we still go with the optional
> state management approach, documentation have to be clear in explaining the
> implications.
>
> Sivaprasanna
>
> On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:
>
> > Okay. So, a new processor called "ScanSFTP", allow incoming relationship
> > where the content of the flow file is replaced with the list of matching
> > files from the remote directory, then the list is filtered by the usual
> > regex parameters like today. Optional state information is kept to
> > additionally filter the list of files older than the newest file
> > observed during the last run. Does that sound okay to everyone? If so,
> > what's the next step?
> >
> > Scott
> >
> >
> > On 03/27/2018 06:21 PM, scott wrote:
> > >
> > > This is a great discussion, and appreciate the interest in my problem.
> > > I think there are workarounds if you decide not to store state, but
> > > I'd recommend keeping it. I think state should be kept optionally,
> > > even turned off by default. Several times I've had issues where the
> > > state has cause me to miss files, because files get moved into the
> > > source folder out of order, and I've wished I could turn the state
> > > feature off.
> > >
> > > In my current use-case, I would not be frequently, dynamically
> > > changing the source directory, though I can see the use-cases where it
> > > would be. In my current use-case, I want to use an external database
> > > table to control the configuration of all my flows. I do this by first
> > > reading the content of the table for this particular flow ID, then
> > > assign the result as attributes to the flowfile, essentially creating
> > > variables I can use throughout the flow to control its behavior. This
> > > works great with flows that initiate with HTTP or SQL, but not
> > > ListSFTP or ListFile.
> > >
> > > Scott
> > >
> > >
> > > On 03/27/2018 02:05 PM, Andy LoPresto wrote:
> > >> I think Bryan’s point is a good one and when I first saw this
> > >> question (and thought of the previous times it’s been asked), my
> > >> initial response is to propose a second processor.
> > >>
> > >> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> > >> differently from ListSFTP — it does not maintain state, and performs
> > >> a one-time tabulation/chronicling of the state of that directory at
> > >> the given point in time.
> > >>
> > >> The responsibility to maintain and compare state across time is no
> > >> longer a requirement. There could even be a setting in the processor
> > >> to allow for “individual flowfile output” (i.e. act the same as
> > >> ListSFTP and output one flowfile per item listed) or “summary
> > >> flowfile output” where a single flowfile is generated containing the
> > >> directory listing information for all the items there. (Another
> > >> option is to output both on two different relationships).
> > >>
> > >> I think this would enable the types of workflows that users have
> > >> asked about in the past without compromising the mechanism by which
> > >> List* processors work and adding undue complexity to those processors.
> > >>
> > >> Absolutely crystal clear documentation (and a standard verb for the
> > >> new processor family) would be necessary (not only because these
> > >> processor solve different problems, but to avoid a million variants
> > >> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
> > >> provide a directory in an attribute to ListSFTP” mailing list
> > >> questions).
> > >>
> > >>
> > >> Andy LoPresto
> > >> alopresto@apache.org <ma...@apache.org>
> > >> /alopresto.apache@gmail.com <ma...@gmail.com>/
> > >> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> > >>
> > >>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
> > >>> <ma...@gmail.com>> wrote:
> > >>>
> > >>> The key here is that ListXXX processor maintains state. A directory
> > >>> is part
> > >>> of such state. Allowing arbitrary directories via an expression would
> > >>> create never ending stream of new entries in the state storage,
> > >>> effectively
> > >>> engineering a distributed DoS attack on the NiFi node or shared ZK
> > >>> quorum
> > >>> (for when state is stored in there).
> > >>>
> > >>> Maybe if we focus on thinking about assumptions and restrictions the
> > >>> processor should make to contain that risk...
> > >>>
> > >>> Andrew
> > >>>
> > >>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
> > >>> <ma...@gmail.com>> wrote:
> > >>>
> > >>>> I'm not sure that would solve the problem because you'd still be
> > >>>> limited to one directory. What most people are asking for is the
> > >>>> ability to use a dynamic directory from an incoming flow file.
> > >>>>
> > >>>> I think we might be trying to fit two different use-cases into one
> > >>>> processor which might not make sense.
> > >>>>
> > >>>> Scenario #1... There is a directory that is constantly receiving new
> > >>>> data and has a significant amount of files, and I want to
> periodically
> > >>>> find new files. This is what the current processors are optimized
> for.
> > >>>>
> > >>>> Scenario #2... There is a directory that is mostly static with a
> > >>>> moderate/small number of files, and at points in my flow I want to
> > >>>> dynamically perform a listing of this directory and retrieve the
> > >>>> files. This is more geared towards the mentality of running a
> > >>>> job/workflow.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
> > >>>> <ottobackwards@gmail.com <ma...@gmail.com>>
> > >>>> wrote:
> > >>>>> What if the changes where ‘on top of’ some base set of properties,
> > >>>>> like
> > >>>>> directory?
> > >>>>> Like a filter, where if present from the incoming file will have
> the
> > >>>> LIST*
> > >>>>> list only things
> > >>>>> that match a name or attribute?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
> > >>>>> <ma...@gmail.com>) wrote:
> > >>>>>
> > >>>>> Scott
> > >>>>>
> > >>>>> This idea has come up a couple of times and there is definitely
> > >>>>> something intriguing to it. Where I think this idea stalls out
> though
> > >>>>> is in implementation.
> > >>>>>
> > >>>>> While I agree that the other List* processors might similarly
> benefit
> > >>>>> lets focus on ListFile. Today you tell ListFile what directory to
> > >>>>> start looking for files in. It goes off scanning that directory for
> > >>>>> hits and stores state about what it has already searched/seen. And
> it
> > >>>>> is important to keep track of how much it has already scanned
> because
> > >>>>> at times the search directory can be massive (100,000s of thousands
> > or
> > >>>>> more files and directories to scan for example).
> > >>>>>
> > >>>>> In the proposed model the directory to be scanned could be provided
> > >>>>> dynamically by looking at an attribute of an incoming flowfile (or
> > >>>>> other criteria can be provided - not just the directory to scan).
> In
> > >>>>> this case the ListFile processor goes on scanning against that now.
> > >>>>> What about the previous directory (or directories) it was told to
> > >>>>> scan? Does it still track those too? What if it starts scanning the
> > >>>>> newly provided directory, hasn't finished pulling all the data or
> new
> > >>>>> data is continually arriving, and it is told to switch to another
> > >>>>> directory.
> > >>>>>
> > >>>>> I think if those questions can get solid answers and someone
> invests
> > >>>>> time in creating a PR then this could be pretty powerful. Would be
> > >>>>> good to see a written description of the use case(s) for this too.
> > >>>>>
> > >>>>> Thanks
> > >>>>> Joe
> > >>>>>
> > >>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
> > >>>>> <ma...@gmail.com>> wrote:
> > >>>>>> Hello Devs,
> > >>>>>>
> > >>>>>> I would like to request a feature to a major processor, ListSFTP.
> > But
> > >>>>> before
> > >>>>>> I do down the official road, I wanted to ask if anyone thought it
> > >>>>>> was a
> > >>>>>> terrible idea or impossible, etc. The request is to add support
> > >>>>>> for an
> > >>>>>> incoming relationship to the ListSFTP processor specifically, but
> I
> > >>>> could
> > >>>>>> see it added to many of the commonly used head processes, such as
> > >>>>> ListFile.
> > >>>>>> I would envision functionality more like InvokeHTTP or
> > >>>>>> ExecuteSQL, where
> > >>>>> an
> > >>>>>> incoming flow file could initiate the action, and the attributes
> > >>>>>> in the
> > >>>>>> incoming flow file could be used to configure the processor
> actions.
> > >>>> It's
> > >>>>>> the configuration aspect that most appeals to me, because it
> > >>>>>> opens it up
> > >>>>> to
> > >>>>>> being centrally or dynamically configured.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Scott
> > >>>>>>
> > >>>>
> > >>
> > >
> >
> >
>
-- 
Sent from Gmail Mobile

Re: ListSFTP incoming relationship

Posted by Sivaprasanna <si...@gmail.com>.

Should we really have to have an optional state saving functionality? If
the user is unaware of the implications and proceed to store the state then
what Andrew Grande mentioned will happen - possibilities of never ending
stream of state information being stored. If we still go with the optional
state management approach, documentation have to be clear in explaining the
implications.

Sivaprasanna

On Thu, 29 Mar 2018 at 9:28 AM, scott <tc...@gmail.com> wrote:

> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
> where the content of the flow file is replaced with the list of matching
> files from the remote directory, then the list is filtered by the usual
> regex parameters like today. Optional state information is kept to
> additionally filter the list of files older than the newest file
> observed during the last run. Does that sound okay to everyone? If so,
> what's the next step?
>
> Scott
>
>
> On 03/27/2018 06:21 PM, scott wrote:
> >
> > This is a great discussion, and appreciate the interest in my problem.
> > I think there are workarounds if you decide not to store state, but
> > I'd recommend keeping it. I think state should be kept optionally,
> > even turned off by default. Several times I've had issues where the
> > state has cause me to miss files, because files get moved into the
> > source folder out of order, and I've wished I could turn the state
> > feature off.
> >
> > In my current use-case, I would not be frequently, dynamically
> > changing the source directory, though I can see the use-cases where it
> > would be. In my current use-case, I want to use an external database
> > table to control the configuration of all my flows. I do this by first
> > reading the content of the table for this particular flow ID, then
> > assign the result as attributes to the flowfile, essentially creating
> > variables I can use throughout the flow to control its behavior. This
> > works great with flows that initiate with HTTP or SQL, but not
> > ListSFTP or ListFile.
> >
> > Scott
> >
> >
> > On 03/27/2018 02:05 PM, Andy LoPresto wrote:
> >> I think Bryan’s point is a good one and when I first saw this
> >> question (and thought of the previous times it’s been asked), my
> >> initial response is to propose a second processor.
> >>
> >> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> >> differently from ListSFTP — it does not maintain state, and performs
> >> a one-time tabulation/chronicling of the state of that directory at
> >> the given point in time.
> >>
> >> The responsibility to maintain and compare state across time is no
> >> longer a requirement. There could even be a setting in the processor
> >> to allow for “individual flowfile output” (i.e. act the same as
> >> ListSFTP and output one flowfile per item listed) or “summary
> >> flowfile output” where a single flowfile is generated containing the
> >> directory listing information for all the items there. (Another
> >> option is to output both on two different relationships).
> >>
> >> I think this would enable the types of workflows that users have
> >> asked about in the past without compromising the mechanism by which
> >> List* processors work and adding undue complexity to those processors.
> >>
> >> Absolutely crystal clear documentation (and a standard verb for the
> >> new processor family) would be necessary (not only because these
> >> processor solve different problems, but to avoid a million variants
> >> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
> >> provide a directory in an attribute to ListSFTP” mailing list
> >> questions).
> >>
> >>
> >> Andy LoPresto
> >> alopresto@apache.org <ma...@apache.org>
> >> /alopresto.apache@gmail.com <ma...@gmail.com>/
> >> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >>
> >>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>
> >>> The key here is that ListXXX processor maintains state. A directory
> >>> is part
> >>> of such state. Allowing arbitrary directories via an expression would
> >>> create never ending stream of new entries in the state storage,
> >>> effectively
> >>> engineering a distributed DoS attack on the NiFi node or shared ZK
> >>> quorum
> >>> (for when state is stored in there).
> >>>
> >>> Maybe if we focus on thinking about assumptions and restrictions the
> >>> processor should make to contain that risk...
> >>>
> >>> Andrew
> >>>
> >>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>
> >>>> I'm not sure that would solve the problem because you'd still be
> >>>> limited to one directory. What most people are asking for is the
> >>>> ability to use a dynamic directory from an incoming flow file.
> >>>>
> >>>> I think we might be trying to fit two different use-cases into one
> >>>> processor which might not make sense.
> >>>>
> >>>> Scenario #1... There is a directory that is constantly receiving new
> >>>> data and has a significant amount of files, and I want to periodically
> >>>> find new files. This is what the current processors are optimized for.
> >>>>
> >>>> Scenario #2... There is a directory that is mostly static with a
> >>>> moderate/small number of files, and at points in my flow I want to
> >>>> dynamically perform a listing of this directory and retrieve the
> >>>> files. This is more geared towards the mentality of running a
> >>>> job/workflow.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
> >>>> <ottobackwards@gmail.com <ma...@gmail.com>>
> >>>> wrote:
> >>>>> What if the changes where ‘on top of’ some base set of properties,
> >>>>> like
> >>>>> directory?
> >>>>> Like a filter, where if present from the incoming file will have the
> >>>> LIST*
> >>>>> list only things
> >>>>> that match a name or attribute?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
> >>>>> <ma...@gmail.com>) wrote:
> >>>>>
> >>>>> Scott
> >>>>>
> >>>>> This idea has come up a couple of times and there is definitely
> >>>>> something intriguing to it. Where I think this idea stalls out though
> >>>>> is in implementation.
> >>>>>
> >>>>> While I agree that the other List* processors might similarly benefit
> >>>>> lets focus on ListFile. Today you tell ListFile what directory to
> >>>>> start looking for files in. It goes off scanning that directory for
> >>>>> hits and stores state about what it has already searched/seen. And it
> >>>>> is important to keep track of how much it has already scanned because
> >>>>> at times the search directory can be massive (100,000s of thousands
> or
> >>>>> more files and directories to scan for example).
> >>>>>
> >>>>> In the proposed model the directory to be scanned could be provided
> >>>>> dynamically by looking at an attribute of an incoming flowfile (or
> >>>>> other criteria can be provided - not just the directory to scan). In
> >>>>> this case the ListFile processor goes on scanning against that now.
> >>>>> What about the previous directory (or directories) it was told to
> >>>>> scan? Does it still track those too? What if it starts scanning the
> >>>>> newly provided directory, hasn't finished pulling all the data or new
> >>>>> data is continually arriving, and it is told to switch to another
> >>>>> directory.
> >>>>>
> >>>>> I think if those questions can get solid answers and someone invests
> >>>>> time in creating a PR then this could be pretty powerful. Would be
> >>>>> good to see a written description of the use case(s) for this too.
> >>>>>
> >>>>> Thanks
> >>>>> Joe
> >>>>>
> >>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
> >>>>> <ma...@gmail.com>> wrote:
> >>>>>> Hello Devs,
> >>>>>>
> >>>>>> I would like to request a feature to a major processor, ListSFTP.
> But
> >>>>> before
> >>>>>> I do down the official road, I wanted to ask if anyone thought it
> >>>>>> was a
> >>>>>> terrible idea or impossible, etc. The request is to add support
> >>>>>> for an
> >>>>>> incoming relationship to the ListSFTP processor specifically, but I
> >>>> could
> >>>>>> see it added to many of the commonly used head processes, such as
> >>>>> ListFile.
> >>>>>> I would envision functionality more like InvokeHTTP or
> >>>>>> ExecuteSQL, where
> >>>>> an
> >>>>>> incoming flow file could initiate the action, and the attributes
> >>>>>> in the
> >>>>>> incoming flow file could be used to configure the processor actions.
> >>>> It's
> >>>>>> the configuration aspect that most appeals to me, because it
> >>>>>> opens it up
> >>>>> to
> >>>>>> being centrally or dynamically configured.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Scott
> >>>>>>
> >>>>
> >>
> >
>
>

Re: ListSFTP incoming relationship

Posted by scott <tc...@gmail.com>.

Okay. So, a new processor called "ScanSFTP", allow incoming relationship 
where the content of the flow file is replaced with the list of matching 
files from the remote directory, then the list is filtered by the usual 
regex parameters like today. Optional state information is kept to 
additionally filter the list of files older than the newest file 
observed during the last run. Does that sound okay to everyone? If so, 
what's the next step?

Scott


On 03/27/2018 06:21 PM, scott wrote:
>
> This is a great discussion, and appreciate the interest in my problem. 
> I think there are workarounds if you decide not to store state, but 
> I'd recommend keeping it. I think state should be kept optionally, 
> even turned off by default. Several times I've had issues where the 
> state has cause me to miss files, because files get moved into the 
> source folder out of order, and I've wished I could turn the state 
> feature off.
>
> In my current use-case, I would not be frequently, dynamically 
> changing the source directory, though I can see the use-cases where it 
> would be. In my current use-case, I want to use an external database 
> table to control the configuration of all my flows. I do this by first 
> reading the content of the table for this particular flow ID, then 
> assign the result as attributes to the flowfile, essentially creating 
> variables I can use throughout the flow to control its behavior. This 
> works great with flows that initiate with HTTP or SQL, but not 
> ListSFTP or ListFile.
>
> Scott
>
>
> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>> I think Bryan’s point is a good one and when I first saw this 
>> question (and thought of the previous times it’s been asked), my 
>> initial response is to propose a second processor.
>>
>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates 
>> differently from ListSFTP — it does not maintain state, and performs 
>> a one-time tabulation/chronicling of the state of that directory at 
>> the given point in time.
>>
>> The responsibility to maintain and compare state across time is no 
>> longer a requirement. There could even be a setting in the processor 
>> to allow for “individual flowfile output” (i.e. act the same as 
>> ListSFTP and output one flowfile per item listed) or “summary 
>> flowfile output” where a single flowfile is generated containing the 
>> directory listing information for all the items there. (Another 
>> option is to output both on two different relationships).
>>
>> I think this would enable the types of workflows that users have 
>> asked about in the past without compromising the mechanism by which 
>> List* processors work and adding undue complexity to those processors.
>>
>> Absolutely crystal clear documentation (and a standard verb for the 
>> new processor family) would be necessary (not only because these 
>> processor solve different problems, but to avoid a million variants 
>> of “I used ScanSFTP processor and it’s not tracking state”/“How do I 
>> provide a directory in an attribute to ListSFTP” mailing list 
>> questions).
>>
>>
>> Andy LoPresto
>> alopresto@apache.org <ma...@apache.org>
>> /alopresto.apache@gmail.com <ma...@gmail.com>/
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>
>>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com 
>>> <ma...@gmail.com>> wrote:
>>>
>>> The key here is that ListXXX processor maintains state. A directory 
>>> is part
>>> of such state. Allowing arbitrary directories via an expression would
>>> create never ending stream of new entries in the state storage, 
>>> effectively
>>> engineering a distributed DoS attack on the NiFi node or shared ZK 
>>> quorum
>>> (for when state is stored in there).
>>>
>>> Maybe if we focus on thinking about assumptions and restrictions the
>>> processor should make to contain that risk...
>>>
>>> Andrew
>>>
>>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com 
>>> <ma...@gmail.com>> wrote:
>>>
>>>> I'm not sure that would solve the problem because you'd still be
>>>> limited to one directory. What most people are asking for is the
>>>> ability to use a dynamic directory from an incoming flow file.
>>>>
>>>> I think we might be trying to fit two different use-cases into one
>>>> processor which might not make sense.
>>>>
>>>> Scenario #1... There is a directory that is constantly receiving new
>>>> data and has a significant amount of files, and I want to periodically
>>>> find new files. This is what the current processors are optimized for.
>>>>
>>>> Scenario #2... There is a directory that is mostly static with a
>>>> moderate/small number of files, and at points in my flow I want to
>>>> dynamically perform a listing of this directory and retrieve the
>>>> files. This is more geared towards the mentality of running a
>>>> job/workflow.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler 
>>>> <ottobackwards@gmail.com <ma...@gmail.com>>
>>>> wrote:
>>>>> What if the changes where ‘on top of’ some base set of properties, 
>>>>> like
>>>>> directory?
>>>>> Like a filter, where if present from the incoming file will have the
>>>> LIST*
>>>>> list only things
>>>>> that match a name or attribute?
>>>>>
>>>>>
>>>>>
>>>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com 
>>>>> <ma...@gmail.com>) wrote:
>>>>>
>>>>> Scott
>>>>>
>>>>> This idea has come up a couple of times and there is definitely
>>>>> something intriguing to it. Where I think this idea stalls out though
>>>>> is in implementation.
>>>>>
>>>>> While I agree that the other List* processors might similarly benefit
>>>>> lets focus on ListFile. Today you tell ListFile what directory to
>>>>> start looking for files in. It goes off scanning that directory for
>>>>> hits and stores state about what it has already searched/seen. And it
>>>>> is important to keep track of how much it has already scanned because
>>>>> at times the search directory can be massive (100,000s of thousands or
>>>>> more files and directories to scan for example).
>>>>>
>>>>> In the proposed model the directory to be scanned could be provided
>>>>> dynamically by looking at an attribute of an incoming flowfile (or
>>>>> other criteria can be provided - not just the directory to scan). In
>>>>> this case the ListFile processor goes on scanning against that now.
>>>>> What about the previous directory (or directories) it was told to
>>>>> scan? Does it still track those too? What if it starts scanning the
>>>>> newly provided directory, hasn't finished pulling all the data or new
>>>>> data is continually arriving, and it is told to switch to another
>>>>> directory.
>>>>>
>>>>> I think if those questions can get solid answers and someone invests
>>>>> time in creating a PR then this could be pretty powerful. Would be
>>>>> good to see a written description of the use case(s) for this too.
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com 
>>>>> <ma...@gmail.com>> wrote:
>>>>>> Hello Devs,
>>>>>>
>>>>>> I would like to request a feature to a major processor, ListSFTP. But
>>>>> before
>>>>>> I do down the official road, I wanted to ask if anyone thought it 
>>>>>> was a
>>>>>> terrible idea or impossible, etc. The request is to add support 
>>>>>> for an
>>>>>> incoming relationship to the ListSFTP processor specifically, but I
>>>> could
>>>>>> see it added to many of the commonly used head processes, such as
>>>>> ListFile.
>>>>>> I would envision functionality more like InvokeHTTP or 
>>>>>> ExecuteSQL, where
>>>>> an
>>>>>> incoming flow file could initiate the action, and the attributes 
>>>>>> in the
>>>>>> incoming flow file could be used to configure the processor actions.
>>>> It's
>>>>>> the configuration aspect that most appeals to me, because it 
>>>>>> opens it up
>>>>> to
>>>>>> being centrally or dynamically configured.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Scott
>>>>>>
>>>>
>>
>

Re: ListSFTP incoming relationship

Posted by scott <tc...@gmail.com>.

This is a great discussion, and appreciate the interest in my problem. I 
think there are workarounds if you decide not to store state, but I'd 
recommend keeping it. I think state should be kept optionally, even 
turned off by default. Several times I've had issues where the state has 
cause me to miss files, because files get moved into the source folder 
out of order, and I've wished I could turn the state feature off.

In my current use-case, I would not be frequently, dynamically changing 
the source directory, though I can see the use-cases where it would be. 
In my current use-case, I want to use an external database table to 
control the configuration of all my flows. I do this by first reading 
the content of the table for this particular flow ID, then assign the 
result as attributes to the flowfile, essentially creating variables I 
can use throughout the flow to control its behavior. This works great 
with flows that initiate with HTTP or SQL, but not ListSFTP or ListFile.

Scott


On 03/27/2018 02:05 PM, Andy LoPresto wrote:
> I think Bryan’s point is a good one and when I first saw this question 
> (and thought of the previous times it’s been asked), my initial 
> response is to propose a second processor.
>
> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates 
> differently from ListSFTP — it does not maintain state, and performs a 
> one-time tabulation/chronicling of the state of that directory at the 
> given point in time.
>
> The responsibility to maintain and compare state across time is no 
> longer a requirement. There could even be a setting in the processor 
> to allow for “individual flowfile output” (i.e. act the same as 
> ListSFTP and output one flowfile per item listed) or “summary flowfile 
> output” where a single flowfile is generated containing the directory 
> listing information for all the items there. (Another option is to 
> output both on two different relationships).
>
> I think this would enable the types of workflows that users have asked 
> about in the past without compromising the mechanism by which List* 
> processors work and adding undue complexity to those processors.
>
> Absolutely crystal clear documentation (and a standard verb for the 
> new processor family) would be necessary (not only because these 
> processor solve different problems, but to avoid a million variants of 
> “I used ScanSFTP processor and it’s not tracking state”/“How do I 
> provide a directory in an attribute to ListSFTP” mailing list questions).
>
>
> Andy LoPresto
> alopresto@apache.org <ma...@apache.org>
> /alopresto.apache@gmail.com <ma...@gmail.com>/
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>> The key here is that ListXXX processor maintains state. A directory 
>> is part
>> of such state. Allowing arbitrary directories via an expression would
>> create never ending stream of new entries in the state storage, 
>> effectively
>> engineering a distributed DoS attack on the NiFi node or shared ZK quorum
>> (for when state is stored in there).
>>
>> Maybe if we focus on thinking about assumptions and restrictions the
>> processor should make to contain that risk...
>>
>> Andrew
>>
>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>>> I'm not sure that would solve the problem because you'd still be
>>> limited to one directory. What most people are asking for is the
>>> ability to use a dynamic directory from an incoming flow file.
>>>
>>> I think we might be trying to fit two different use-cases into one
>>> processor which might not make sense.
>>>
>>> Scenario #1... There is a directory that is constantly receiving new
>>> data and has a significant amount of files, and I want to periodically
>>> find new files. This is what the current processors are optimized for.
>>>
>>> Scenario #2... There is a directory that is mostly static with a
>>> moderate/small number of files, and at points in my flow I want to
>>> dynamically perform a listing of this directory and retrieve the
>>> files. This is more geared towards the mentality of running a
>>> job/workflow.
>>>
>>>
>>>
>>>
>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler 
>>> <ottobackwards@gmail.com <ma...@gmail.com>>
>>> wrote:
>>>> What if the changes where ‘on top of’ some base set of properties, like
>>>> directory?
>>>> Like a filter, where if present from the incoming file will have the
>>> LIST*
>>>> list only things
>>>> that match a name or attribute?
>>>>
>>>>
>>>>
>>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com 
>>>> <ma...@gmail.com>) wrote:
>>>>
>>>> Scott
>>>>
>>>> This idea has come up a couple of times and there is definitely
>>>> something intriguing to it. Where I think this idea stalls out though
>>>> is in implementation.
>>>>
>>>> While I agree that the other List* processors might similarly benefit
>>>> lets focus on ListFile. Today you tell ListFile what directory to
>>>> start looking for files in. It goes off scanning that directory for
>>>> hits and stores state about what it has already searched/seen. And it
>>>> is important to keep track of how much it has already scanned because
>>>> at times the search directory can be massive (100,000s of thousands or
>>>> more files and directories to scan for example).
>>>>
>>>> In the proposed model the directory to be scanned could be provided
>>>> dynamically by looking at an attribute of an incoming flowfile (or
>>>> other criteria can be provided - not just the directory to scan). In
>>>> this case the ListFile processor goes on scanning against that now.
>>>> What about the previous directory (or directories) it was told to
>>>> scan? Does it still track those too? What if it starts scanning the
>>>> newly provided directory, hasn't finished pulling all the data or new
>>>> data is continually arriving, and it is told to switch to another
>>>> directory.
>>>>
>>>> I think if those questions can get solid answers and someone invests
>>>> time in creating a PR then this could be pretty powerful. Would be
>>>> good to see a written description of the use case(s) for this too.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com 
>>>> <ma...@gmail.com>> wrote:
>>>>> Hello Devs,
>>>>>
>>>>> I would like to request a feature to a major processor, ListSFTP. But
>>>> before
>>>>> I do down the official road, I wanted to ask if anyone thought it 
>>>>> was a
>>>>> terrible idea or impossible, etc. The request is to add support for an
>>>>> incoming relationship to the ListSFTP processor specifically, but I
>>> could
>>>>> see it added to many of the commonly used head processes, such as
>>>> ListFile.
>>>>> I would envision functionality more like InvokeHTTP or ExecuteSQL, 
>>>>> where
>>>> an
>>>>> incoming flow file could initiate the action, and the attributes 
>>>>> in the
>>>>> incoming flow file could be used to configure the processor actions.
>>> It's
>>>>> the configuration aspect that most appeals to me, because it opens 
>>>>> it up
>>>> to
>>>>> being centrally or dynamically configured.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Scott
>>>>>
>>>
>

Re: ListSFTP incoming relationship

Posted by Joe Witt <jo...@gmail.com>.

+1 to Bryan/AndyL recommendation.

"Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
differently from ListSFTP — it does not maintain state, and performs a
one-time tabulation/chronicling of the state of that directory at the
given point in time. "

On Tue, Mar 27, 2018 at 5:05 PM, Andy LoPresto <al...@apache.org> wrote:
> I think Bryan’s point is a good one and when I first saw this question (and
> thought of the previous times it’s been asked), my initial response is to
> propose a second processor.
>
> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> differently from ListSFTP — it does not maintain state, and performs a
> one-time tabulation/chronicling of the state of that directory at the given
> point in time.
>
> The responsibility to maintain and compare state across time is no longer a
> requirement. There could even be a setting in the processor to allow for
> “individual flowfile output” (i.e. act the same as ListSFTP and output one
> flowfile per item listed) or “summary flowfile output” where a single
> flowfile is generated containing the directory listing information for all
> the items there. (Another option is to output both on two different
> relationships).
>
> I think this would enable the types of workflows that users have asked about
> in the past without compromising the mechanism by which List* processors
> work and adding undue complexity to those processors.
>
> Absolutely crystal clear documentation (and a standard verb for the new
> processor family) would be necessary (not only because these processor solve
> different problems, but to avoid a million variants of “I used ScanSFTP
> processor and it’s not tracking state”/“How do I provide a directory in an
> attribute to ListSFTP” mailing list questions).
>
>
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Mar 27, 2018, at 8:33 AM, Andrew Grande <ap...@gmail.com> wrote:
>
> The key here is that ListXXX processor maintains state. A directory is part
> of such state. Allowing arbitrary directories via an expression would
> create never ending stream of new entries in the state storage, effectively
> engineering a distributed DoS attack on the NiFi node or shared ZK quorum
> (for when state is stored in there).
>
> Maybe if we focus on thinking about assumptions and restrictions the
> processor should make to contain that risk...
>
> Andrew
>
> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bb...@gmail.com> wrote:
>
> I'm not sure that would solve the problem because you'd still be
> limited to one directory. What most people are asking for is the
> ability to use a dynamic directory from an incoming flow file.
>
> I think we might be trying to fit two different use-cases into one
> processor which might not make sense.
>
> Scenario #1... There is a directory that is constantly receiving new
> data and has a significant amount of files, and I want to periodically
> find new files. This is what the current processors are optimized for.
>
> Scenario #2... There is a directory that is mostly static with a
> moderate/small number of files, and at points in my flow I want to
> dynamically perform a listing of this directory and retrieve the
> files. This is more geared towards the mentality of running a
> job/workflow.
>
>
>
>
> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler <ot...@gmail.com>
> wrote:
>
> What if the changes where ‘on top of’ some base set of properties, like
> directory?
> Like a filter, where if present from the incoming file will have the
>
> LIST*
>
> list only things
> that match a name or attribute?
>
>
>
> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com) wrote:
>
> Scott
>
> This idea has come up a couple of times and there is definitely
> something intriguing to it. Where I think this idea stalls out though
> is in implementation.
>
> While I agree that the other List* processors might similarly benefit
> lets focus on ListFile. Today you tell ListFile what directory to
> start looking for files in. It goes off scanning that directory for
> hits and stores state about what it has already searched/seen. And it
> is important to keep track of how much it has already scanned because
> at times the search directory can be massive (100,000s of thousands or
> more files and directories to scan for example).
>
> In the proposed model the directory to be scanned could be provided
> dynamically by looking at an attribute of an incoming flowfile (or
> other criteria can be provided - not just the directory to scan). In
> this case the ListFile processor goes on scanning against that now.
> What about the previous directory (or directories) it was told to
> scan? Does it still track those too? What if it starts scanning the
> newly provided directory, hasn't finished pulling all the data or new
> data is continually arriving, and it is told to switch to another
> directory.
>
> I think if those questions can get solid answers and someone invests
> time in creating a PR then this could be pretty powerful. Would be
> good to see a written description of the use case(s) for this too.
>
> Thanks
> Joe
>
> On Mon, Mar 26, 2018 at 11:58 PM, scott <tc...@gmail.com> wrote:
>
> Hello Devs,
>
> I would like to request a feature to a major processor, ListSFTP. But
>
> before
>
> I do down the official road, I wanted to ask if anyone thought it was a
> terrible idea or impossible, etc. The request is to add support for an
> incoming relationship to the ListSFTP processor specifically, but I
>
> could
>
> see it added to many of the commonly used head processes, such as
>
> ListFile.
>
> I would envision functionality more like InvokeHTTP or ExecuteSQL, where
>
> an
>
> incoming flow file could initiate the action, and the attributes in the
> incoming flow file could be used to configure the processor actions.
>
> It's
>
> the configuration aspect that most appeals to me, because it opens it up
>
> to
>
> being centrally or dynamically configured.
>
> Thanks,
>
> Scott
>
>
>

Re: ListSFTP incoming relationship

Posted by Andy LoPresto <al...@apache.org>.

I think Bryan’s point is a good one and when I first saw this question (and thought of the previous times it’s been asked), my initial response is to propose a second processor.

Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates differently from ListSFTP — it does not maintain state, and performs a one-time tabulation/chronicling of the state of that directory at the given point in time.

The responsibility to maintain and compare state across time is no longer a requirement. There could even be a setting in the processor to allow for “individual flowfile output” (i.e. act the same as ListSFTP and output one flowfile per item listed) or “summary flowfile output” where a single flowfile is generated containing the directory listing information for all the items there. (Another option is to output both on two different relationships).

I think this would enable the types of workflows that users have asked about in the past without compromising the mechanism by which List* processors work and adding undue complexity to those processors.

Absolutely crystal clear documentation (and a standard verb for the new processor family) would be necessary (not only because these processor solve different problems, but to avoid a million variants of “I used ScanSFTP processor and it’s not tracking state”/“How do I provide a directory in an attribute to ListSFTP” mailing list questions).


Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 27, 2018, at 8:33 AM, Andrew Grande <ap...@gmail.com> wrote:
> 
> The key here is that ListXXX processor maintains state. A directory is part
> of such state. Allowing arbitrary directories via an expression would
> create never ending stream of new entries in the state storage, effectively
> engineering a distributed DoS attack on the NiFi node or shared ZK quorum
> (for when state is stored in there).
> 
> Maybe if we focus on thinking about assumptions and restrictions the
> processor should make to contain that risk...
> 
> Andrew
> 
> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bb...@gmail.com> wrote:
> 
>> I'm not sure that would solve the problem because you'd still be
>> limited to one directory. What most people are asking for is the
>> ability to use a dynamic directory from an incoming flow file.
>> 
>> I think we might be trying to fit two different use-cases into one
>> processor which might not make sense.
>> 
>> Scenario #1... There is a directory that is constantly receiving new
>> data and has a significant amount of files, and I want to periodically
>> find new files. This is what the current processors are optimized for.
>> 
>> Scenario #2... There is a directory that is mostly static with a
>> moderate/small number of files, and at points in my flow I want to
>> dynamically perform a listing of this directory and retrieve the
>> files. This is more geared towards the mentality of running a
>> job/workflow.
>> 
>> 
>> 
>> 
>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler <ot...@gmail.com>
>> wrote:
>>> What if the changes where ‘on top of’ some base set of properties, like
>>> directory?
>>> Like a filter, where if present from the incoming file will have the
>> LIST*
>>> list only things
>>> that match a name or attribute?
>>> 
>>> 
>>> 
>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com) wrote:
>>> 
>>> Scott
>>> 
>>> This idea has come up a couple of times and there is definitely
>>> something intriguing to it. Where I think this idea stalls out though
>>> is in implementation.
>>> 
>>> While I agree that the other List* processors might similarly benefit
>>> lets focus on ListFile. Today you tell ListFile what directory to
>>> start looking for files in. It goes off scanning that directory for
>>> hits and stores state about what it has already searched/seen. And it
>>> is important to keep track of how much it has already scanned because
>>> at times the search directory can be massive (100,000s of thousands or
>>> more files and directories to scan for example).
>>> 
>>> In the proposed model the directory to be scanned could be provided
>>> dynamically by looking at an attribute of an incoming flowfile (or
>>> other criteria can be provided - not just the directory to scan). In
>>> this case the ListFile processor goes on scanning against that now.
>>> What about the previous directory (or directories) it was told to
>>> scan? Does it still track those too? What if it starts scanning the
>>> newly provided directory, hasn't finished pulling all the data or new
>>> data is continually arriving, and it is told to switch to another
>>> directory.
>>> 
>>> I think if those questions can get solid answers and someone invests
>>> time in creating a PR then this could be pretty powerful. Would be
>>> good to see a written description of the use case(s) for this too.
>>> 
>>> Thanks
>>> Joe
>>> 
>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tc...@gmail.com> wrote:
>>>> Hello Devs,
>>>> 
>>>> I would like to request a feature to a major processor, ListSFTP. But
>>> before
>>>> I do down the official road, I wanted to ask if anyone thought it was a
>>>> terrible idea or impossible, etc. The request is to add support for an
>>>> incoming relationship to the ListSFTP processor specifically, but I
>> could
>>>> see it added to many of the commonly used head processes, such as
>>> ListFile.
>>>> I would envision functionality more like InvokeHTTP or ExecuteSQL, where
>>> an
>>>> incoming flow file could initiate the action, and the attributes in the
>>>> incoming flow file could be used to configure the processor actions.
>> It's
>>>> the configuration aspect that most appeals to me, because it opens it up
>>> to
>>>> being centrally or dynamically configured.
>>>> 
>>>> Thanks,
>>>> 
>>>> Scott
>>>> 
>>

Re: ListSFTP incoming relationship

Posted by Andrew Grande <ap...@gmail.com>.

The key here is that ListXXX processor maintains state. A directory is part
of such state. Allowing arbitrary directories via an expression would
create never ending stream of new entries in the state storage, effectively
engineering a distributed DoS attack on the NiFi node or shared ZK quorum
(for when state is stored in there).

Maybe if we focus on thinking about assumptions and restrictions the
processor should make to contain that risk...

Andrew

On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bb...@gmail.com> wrote:

> I'm not sure that would solve the problem because you'd still be
> limited to one directory. What most people are asking for is the
> ability to use a dynamic directory from an incoming flow file.
>
> I think we might be trying to fit two different use-cases into one
> processor which might not make sense.
>
> Scenario #1... There is a directory that is constantly receiving new
> data and has a significant amount of files, and I want to periodically
> find new files. This is what the current processors are optimized for.
>
> Scenario #2... There is a directory that is mostly static with a
> moderate/small number of files, and at points in my flow I want to
> dynamically perform a listing of this directory and retrieve the
> files. This is more geared towards the mentality of running a
> job/workflow.
>
>
>
>
> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler <ot...@gmail.com>
> wrote:
> > What if the changes where ‘on top of’ some base set of properties, like
> > directory?
> > Like a filter, where if present from the incoming file will have the
> LIST*
> > list only things
> > that match a name or attribute?
> >
> >
> >
> > On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com) wrote:
> >
> > Scott
> >
> > This idea has come up a couple of times and there is definitely
> > something intriguing to it. Where I think this idea stalls out though
> > is in implementation.
> >
> > While I agree that the other List* processors might similarly benefit
> > lets focus on ListFile. Today you tell ListFile what directory to
> > start looking for files in. It goes off scanning that directory for
> > hits and stores state about what it has already searched/seen. And it
> > is important to keep track of how much it has already scanned because
> > at times the search directory can be massive (100,000s of thousands or
> > more files and directories to scan for example).
> >
> > In the proposed model the directory to be scanned could be provided
> > dynamically by looking at an attribute of an incoming flowfile (or
> > other criteria can be provided - not just the directory to scan). In
> > this case the ListFile processor goes on scanning against that now.
> > What about the previous directory (or directories) it was told to
> > scan? Does it still track those too? What if it starts scanning the
> > newly provided directory, hasn't finished pulling all the data or new
> > data is continually arriving, and it is told to switch to another
> > directory.
> >
> > I think if those questions can get solid answers and someone invests
> > time in creating a PR then this could be pretty powerful. Would be
> > good to see a written description of the use case(s) for this too.
> >
> > Thanks
> > Joe
> >
> > On Mon, Mar 26, 2018 at 11:58 PM, scott <tc...@gmail.com> wrote:
> >> Hello Devs,
> >>
> >> I would like to request a feature to a major processor, ListSFTP. But
> > before
> >> I do down the official road, I wanted to ask if anyone thought it was a
> >> terrible idea or impossible, etc. The request is to add support for an
> >> incoming relationship to the ListSFTP processor specifically, but I
> could
> >> see it added to many of the commonly used head processes, such as
> > ListFile.
> >> I would envision functionality more like InvokeHTTP or ExecuteSQL, where
> > an
> >> incoming flow file could initiate the action, and the attributes in the
> >> incoming flow file could be used to configure the processor actions.
> It's
> >> the configuration aspect that most appeals to me, because it opens it up
> > to
> >> being centrally or dynamically configured.
> >>
> >> Thanks,
> >>
> >> Scott
> >>
>

Re: ListSFTP incoming relationship

Posted by Bryan Bende <bb...@gmail.com>.

I'm not sure that would solve the problem because you'd still be
limited to one directory. What most people are asking for is the
ability to use a dynamic directory from an incoming flow file.

I think we might be trying to fit two different use-cases into one
processor which might not make sense.

Scenario #1... There is a directory that is constantly receiving new
data and has a significant amount of files, and I want to periodically
find new files. This is what the current processors are optimized for.

Scenario #2... There is a directory that is mostly static with a
moderate/small number of files, and at points in my flow I want to
dynamically perform a listing of this directory and retrieve the
files. This is more geared towards the mentality of running a
job/workflow.




On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler <ot...@gmail.com> wrote:
> What if the changes where ‘on top of’ some base set of properties, like
> directory?
> Like a filter, where if present from the incoming file will have the LIST*
> list only things
> that match a name or attribute?
>
>
>
> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com) wrote:
>
> Scott
>
> This idea has come up a couple of times and there is definitely
> something intriguing to it. Where I think this idea stalls out though
> is in implementation.
>
> While I agree that the other List* processors might similarly benefit
> lets focus on ListFile. Today you tell ListFile what directory to
> start looking for files in. It goes off scanning that directory for
> hits and stores state about what it has already searched/seen. And it
> is important to keep track of how much it has already scanned because
> at times the search directory can be massive (100,000s of thousands or
> more files and directories to scan for example).
>
> In the proposed model the directory to be scanned could be provided
> dynamically by looking at an attribute of an incoming flowfile (or
> other criteria can be provided - not just the directory to scan). In
> this case the ListFile processor goes on scanning against that now.
> What about the previous directory (or directories) it was told to
> scan? Does it still track those too? What if it starts scanning the
> newly provided directory, hasn't finished pulling all the data or new
> data is continually arriving, and it is told to switch to another
> directory.
>
> I think if those questions can get solid answers and someone invests
> time in creating a PR then this could be pretty powerful. Would be
> good to see a written description of the use case(s) for this too.
>
> Thanks
> Joe
>
> On Mon, Mar 26, 2018 at 11:58 PM, scott <tc...@gmail.com> wrote:
>> Hello Devs,
>>
>> I would like to request a feature to a major processor, ListSFTP. But
> before
>> I do down the official road, I wanted to ask if anyone thought it was a
>> terrible idea or impossible, etc. The request is to add support for an
>> incoming relationship to the ListSFTP processor specifically, but I could
>> see it added to many of the commonly used head processes, such as
> ListFile.
>> I would envision functionality more like InvokeHTTP or ExecuteSQL, where
> an
>> incoming flow file could initiate the action, and the attributes in the
>> incoming flow file could be used to configure the processor actions. It's
>> the configuration aspect that most appeals to me, because it opens it up
> to
>> being centrally or dynamically configured.
>>
>> Thanks,
>>
>> Scott
>>

Re: ListSFTP incoming relationship

Posted by Otto Fowler <ot...@gmail.com>.

What if the changes where ‘on top of’ some base set of properties, like
directory?
Like a filter, where if present from the incoming file will have the LIST*
list only things
that match a name or attribute?

On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com) wrote:

Scott

This idea has come up a couple of times and there is definitely
something intriguing to it. Where I think this idea stalls out though
is in implementation.

While I agree that the other List* processors might similarly benefit
lets focus on ListFile. Today you tell ListFile what directory to
start looking for files in. It goes off scanning that directory for
hits and stores state about what it has already searched/seen. And it
is important to keep track of how much it has already scanned because
at times the search directory can be massive (100,000s of thousands or
more files and directories to scan for example).

In the proposed model the directory to be scanned could be provided
dynamically by looking at an attribute of an incoming flowfile (or
other criteria can be provided - not just the directory to scan). In
this case the ListFile processor goes on scanning against that now.
What about the previous directory (or directories) it was told to
scan? Does it still track those too? What if it starts scanning the
newly provided directory, hasn't finished pulling all the data or new
data is continually arriving, and it is told to switch to another
directory.

I think if those questions can get solid answers and someone invests
time in creating a PR then this could be pretty powerful. Would be
good to see a written description of the use case(s) for this too.

Thanks
Joe

On Mon, Mar 26, 2018 at 11:58 PM, scott <tc...@gmail.com> wrote:
> Hello Devs,
>
> I would like to request a feature to a major processor, ListSFTP. But
before
> I do down the official road, I wanted to ask if anyone thought it was a
> terrible idea or impossible, etc. The request is to add support for an
> incoming relationship to the ListSFTP processor specifically, but I could
> see it added to many of the commonly used head processes, such as
ListFile.
> I would envision functionality more like InvokeHTTP or ExecuteSQL, where
an
> incoming flow file could initiate the action, and the attributes in the
> incoming flow file could be used to configure the processor actions. It's
> the configuration aspect that most appeals to me, because it opens it up
to
> being centrally or dynamically configured.
>
> Thanks,
>
> Scott
>

Re: ListSFTP incoming relationship

Posted by Joe Witt <jo...@gmail.com>.

Scott

This idea has come up a couple of times and there is definitely
something intriguing to it.  Where I think this idea stalls out though
is in implementation.

While I agree that the other List* processors might similarly benefit
lets focus on ListFile.  Today you tell ListFile what directory to
start looking for files in.  It goes off scanning that directory for
hits and stores state about what it has already searched/seen.  And it
is important to keep track of how much it has already scanned because
at times the search directory can be massive (100,000s of thousands or
more files and directories to scan for example).

In the proposed model the directory to be scanned could be provided
dynamically by looking at an attribute of an incoming flowfile (or
other criteria can be provided - not just the directory to scan).  In
this case the ListFile processor goes on scanning against that now.
What about the previous directory (or directories) it was told to
scan?  Does it still track those too?  What if it starts scanning the
newly provided directory, hasn't finished pulling all the data or new
data is continually arriving, and it is told to switch to another
directory.

I think if those questions can get solid answers and someone invests
time in creating a PR then this could be pretty powerful.  Would be
good to see a written description of the use case(s) for this too.

Thanks
Joe

On Mon, Mar 26, 2018 at 11:58 PM, scott <tc...@gmail.com> wrote:
> Hello Devs,
>
> I would like to request a feature to a major processor, ListSFTP. But before
> I do down the official road, I wanted to ask if anyone thought it was a
> terrible idea or impossible, etc. The request is to add support for an
> incoming relationship to the ListSFTP processor specifically, but I could
> see it added to many of the commonly used head processes, such as ListFile.
> I would envision functionality more like InvokeHTTP or ExecuteSQL, where an
> incoming flow file could initiate the action, and the attributes in the
> incoming flow file could be used to configure the processor actions. It's
> the configuration aspect that most appeals to me, because it opens it up to
> being centrally or dynamically configured.
>
> Thanks,
>
> Scott
>