You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Bryan Bende <bb...@gmail.com> on 2017/11/01 14:34:51 UTC

Re: FetchSFTP vs GetSFTP

The list-fetch approach sounds correct, and the micro acquisition
cluster (if necessary) also sounds like a good idea.

Regarding multiple hosts, the connection pooling in FetchSFTP does
account for that. Its basically a map from the hostname string to a
holder of connections for that hostname.

-Bryan

On Tue, Oct 31, 2017 at 7:55 PM, Ryan Ward <ry...@gmail.com> wrote:
> Yep that's exactly how I have it set up with a push to RPG. Is that
> preferred? I just started playing with it to be honest. I can see how it
> could be tricky if you have to pull from multiple servers each flow file
> could potentially have a different sftp host address in the queues.
>
> All together we have to pull from about 60 servers. If this doesn't work
> out with the list/fetch  I plan to have a micro acquisition cluster just
> for gets.
>
> Ryan
>
> On Oct 31, 2017 4:26 PM, "Bryan Bende" <bb...@gmail.com> wrote:
>
>> Ryan,
>>
>> The 10 seconds appears to be a hard-code rule in the processor,
>> although it seems like it could be turned into a configurable
>> property.
>>
>> It would require a code change to make it grab a batch of flow files
>> during a single execution. In theory it shouldn't provide that much of
>> a difference, but might be an interesting experiment. It makes the
>> code more challenging to write though, not that that's a reason not to
>> do it.
>>
>> If you have a 5 node cluster, you are doing List on primary node and
>> then redistributing the results to all the nodes via an RPG so all
>> nodes can fetch?
>>
>> -Bryan
>>
>>
>> On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <ry...@gmail.com> wrote:
>> > Joe/Bryan Thanks!
>> >
>> > I believe the one specific file per concurrent task/connection (and too
>> > many threads) is the issue I have we have a lot of small files and often
>> > times backed up . I'm going to drop the task count to take advantage of
>> the
>> > pooling. Is it possible to have Fetch do batches vs a single file? Would
>> > that improve throughput? Also is that 10 seconds configurable?
>> >
>> > Some background: I'm converting 2 single nodes into a 5 node cluster and
>> > trying to figure out the best approach.
>> >
>> > Thanks again!
>> >
>> >
>> >
>> > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <bb...@gmail.com> wrote:
>> >
>> >> Ryan,
>> >>
>> >> Personally I don't have experience running these processors at scale,
>> >> but from a code perspective they are fundamentally different...
>> >>
>> >> GetSFTP is a source processor, meaning is not being fed by an upstream
>> >> connection, so when it executes it can create a connection and
>> >> retrieve up to max-selects during that one execution.
>> >>
>> >> FetchSFTP is being told to fetch one specific file, typically through
>> >> attributes on incoming flow files, so the concept of max-selects
>> >> doesn't really apply because there is only thing to select during an
>> >> execution of the processor.
>> >>
>> >> FetchSFTP does employ connection pooling behind the scenes such that
>> >> it will keep open a connection for each concurrent task, as long as
>> >> each connection continues to be used with in 10 seconds.
>> >>
>> >> -Bryan
>> >>
>> >>
>> >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <jo...@gmail.com> wrote:
>> >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i
>> >> > can confirm there are users at that range for it.
>> >> >
>> >> > Thanks
>> >> >
>> >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <ry...@gmail.com>
>> >> wrote:
>> >> >> I've found that on a single node getSFTP is able to pull more files
>> off
>> >> a
>> >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a
>> >> max
>> >> >> selects so it is requiring way more connections (one per file?) and
>> >> >> concurrent threads to keep up.
>> >> >>
>> >> >> Was wondering if anyone is using List/Fetch at scale? In the multi
>> TB's
>> >> a
>> >> >> day range?
>> >> >>
>> >> >> Thanks,
>> >> >> Ryan
>> >>
>>