You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Ryan Ward <ry...@gmail.com> on 2017/10/31 15:38:42 UTC

FetchSFTP vs GetSFTP

I've found that on a single node getSFTP is able to pull more files off a
remote server than Fetch in a cluster. I noticed Fetch doesn't have a max
selects so it is requiring way more connections (one per file?) and
concurrent threads to keep up.

Was wondering if anyone is using List/Fetch at scale? In the multi TB's a
day range?

Thanks,
Ryan

Re: FetchSFTP vs GetSFTP

Posted by Bryan Bende <bb...@gmail.com>.

The list-fetch approach sounds correct, and the micro acquisition
cluster (if necessary) also sounds like a good idea.

Regarding multiple hosts, the connection pooling in FetchSFTP does
account for that. Its basically a map from the hostname string to a
holder of connections for that hostname.

-Bryan

On Tue, Oct 31, 2017 at 7:55 PM, Ryan Ward <ry...@gmail.com> wrote:
> Yep that's exactly how I have it set up with a push to RPG. Is that
> preferred? I just started playing with it to be honest. I can see how it
> could be tricky if you have to pull from multiple servers each flow file
> could potentially have a different sftp host address in the queues.
>
> All together we have to pull from about 60 servers. If this doesn't work
> out with the list/fetch  I plan to have a micro acquisition cluster just
> for gets.
>
> Ryan
>
> On Oct 31, 2017 4:26 PM, "Bryan Bende" <bb...@gmail.com> wrote:
>
>> Ryan,
>>
>> The 10 seconds appears to be a hard-code rule in the processor,
>> although it seems like it could be turned into a configurable
>> property.
>>
>> It would require a code change to make it grab a batch of flow files
>> during a single execution. In theory it shouldn't provide that much of
>> a difference, but might be an interesting experiment. It makes the
>> code more challenging to write though, not that that's a reason not to
>> do it.
>>
>> If you have a 5 node cluster, you are doing List on primary node and
>> then redistributing the results to all the nodes via an RPG so all
>> nodes can fetch?
>>
>> -Bryan
>>
>>
>> On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <ry...@gmail.com> wrote:
>> > Joe/Bryan Thanks!
>> >
>> > I believe the one specific file per concurrent task/connection (and too
>> > many threads) is the issue I have we have a lot of small files and often
>> > times backed up . I'm going to drop the task count to take advantage of
>> the
>> > pooling. Is it possible to have Fetch do batches vs a single file? Would
>> > that improve throughput? Also is that 10 seconds configurable?
>> >
>> > Some background: I'm converting 2 single nodes into a 5 node cluster and
>> > trying to figure out the best approach.
>> >
>> > Thanks again!
>> >
>> >
>> >
>> > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <bb...@gmail.com> wrote:
>> >
>> >> Ryan,
>> >>
>> >> Personally I don't have experience running these processors at scale,
>> >> but from a code perspective they are fundamentally different...
>> >>
>> >> GetSFTP is a source processor, meaning is not being fed by an upstream
>> >> connection, so when it executes it can create a connection and
>> >> retrieve up to max-selects during that one execution.
>> >>
>> >> FetchSFTP is being told to fetch one specific file, typically through
>> >> attributes on incoming flow files, so the concept of max-selects
>> >> doesn't really apply because there is only thing to select during an
>> >> execution of the processor.
>> >>
>> >> FetchSFTP does employ connection pooling behind the scenes such that
>> >> it will keep open a connection for each concurrent task, as long as
>> >> each connection continues to be used with in 10 seconds.
>> >>
>> >> -Bryan
>> >>
>> >>
>> >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <jo...@gmail.com> wrote:
>> >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i
>> >> > can confirm there are users at that range for it.
>> >> >
>> >> > Thanks
>> >> >
>> >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <ry...@gmail.com>
>> >> wrote:
>> >> >> I've found that on a single node getSFTP is able to pull more files
>> off
>> >> a
>> >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a
>> >> max
>> >> >> selects so it is requiring way more connections (one per file?) and
>> >> >> concurrent threads to keep up.
>> >> >>
>> >> >> Was wondering if anyone is using List/Fetch at scale? In the multi
>> TB's
>> >> a
>> >> >> day range?
>> >> >>
>> >> >> Thanks,
>> >> >> Ryan
>> >>
>>

Re: FetchSFTP vs GetSFTP

Posted by Ryan Ward <ry...@gmail.com>.

Yep that's exactly how I have it set up with a push to RPG. Is that
preferred? I just started playing with it to be honest. I can see how it
could be tricky if you have to pull from multiple servers each flow file
could potentially have a different sftp host address in the queues.

All together we have to pull from about 60 servers. If this doesn't work
out with the list/fetch  I plan to have a micro acquisition cluster just
for gets.

Ryan

On Oct 31, 2017 4:26 PM, "Bryan Bende" <bb...@gmail.com> wrote:

> Ryan,
>
> The 10 seconds appears to be a hard-code rule in the processor,
> although it seems like it could be turned into a configurable
> property.
>
> It would require a code change to make it grab a batch of flow files
> during a single execution. In theory it shouldn't provide that much of
> a difference, but might be an interesting experiment. It makes the
> code more challenging to write though, not that that's a reason not to
> do it.
>
> If you have a 5 node cluster, you are doing List on primary node and
> then redistributing the results to all the nodes via an RPG so all
> nodes can fetch?
>
> -Bryan
>
>
> On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <ry...@gmail.com> wrote:
> > Joe/Bryan Thanks!
> >
> > I believe the one specific file per concurrent task/connection (and too
> > many threads) is the issue I have we have a lot of small files and often
> > times backed up . I'm going to drop the task count to take advantage of
> the
> > pooling. Is it possible to have Fetch do batches vs a single file? Would
> > that improve throughput? Also is that 10 seconds configurable?
> >
> > Some background: I'm converting 2 single nodes into a 5 node cluster and
> > trying to figure out the best approach.
> >
> > Thanks again!
> >
> >
> >
> > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <bb...@gmail.com> wrote:
> >
> >> Ryan,
> >>
> >> Personally I don't have experience running these processors at scale,
> >> but from a code perspective they are fundamentally different...
> >>
> >> GetSFTP is a source processor, meaning is not being fed by an upstream
> >> connection, so when it executes it can create a connection and
> >> retrieve up to max-selects during that one execution.
> >>
> >> FetchSFTP is being told to fetch one specific file, typically through
> >> attributes on incoming flow files, so the concept of max-selects
> >> doesn't really apply because there is only thing to select during an
> >> execution of the processor.
> >>
> >> FetchSFTP does employ connection pooling behind the scenes such that
> >> it will keep open a connection for each concurrent task, as long as
> >> each connection continues to be used with in 10 seconds.
> >>
> >> -Bryan
> >>
> >>
> >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <jo...@gmail.com> wrote:
> >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i
> >> > can confirm there are users at that range for it.
> >> >
> >> > Thanks
> >> >
> >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <ry...@gmail.com>
> >> wrote:
> >> >> I've found that on a single node getSFTP is able to pull more files
> off
> >> a
> >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a
> >> max
> >> >> selects so it is requiring way more connections (one per file?) and
> >> >> concurrent threads to keep up.
> >> >>
> >> >> Was wondering if anyone is using List/Fetch at scale? In the multi
> TB's
> >> a
> >> >> day range?
> >> >>
> >> >> Thanks,
> >> >> Ryan
> >>
>

Re: FetchSFTP vs GetSFTP

Posted by Bryan Bende <bb...@gmail.com>.

Ryan,

The 10 seconds appears to be a hard-code rule in the processor,
although it seems like it could be turned into a configurable
property.

It would require a code change to make it grab a batch of flow files
during a single execution. In theory it shouldn't provide that much of
a difference, but might be an interesting experiment. It makes the
code more challenging to write though, not that that's a reason not to
do it.

If you have a 5 node cluster, you are doing List on primary node and
then redistributing the results to all the nodes via an RPG so all
nodes can fetch?

-Bryan


On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <ry...@gmail.com> wrote:
> Joe/Bryan Thanks!
>
> I believe the one specific file per concurrent task/connection (and too
> many threads) is the issue I have we have a lot of small files and often
> times backed up . I'm going to drop the task count to take advantage of the
> pooling. Is it possible to have Fetch do batches vs a single file? Would
> that improve throughput? Also is that 10 seconds configurable?
>
> Some background: I'm converting 2 single nodes into a 5 node cluster and
> trying to figure out the best approach.
>
> Thanks again!
>
>
>
> On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <bb...@gmail.com> wrote:
>
>> Ryan,
>>
>> Personally I don't have experience running these processors at scale,
>> but from a code perspective they are fundamentally different...
>>
>> GetSFTP is a source processor, meaning is not being fed by an upstream
>> connection, so when it executes it can create a connection and
>> retrieve up to max-selects during that one execution.
>>
>> FetchSFTP is being told to fetch one specific file, typically through
>> attributes on incoming flow files, so the concept of max-selects
>> doesn't really apply because there is only thing to select during an
>> execution of the processor.
>>
>> FetchSFTP does employ connection pooling behind the scenes such that
>> it will keep open a connection for each concurrent task, as long as
>> each connection continues to be used with in 10 seconds.
>>
>> -Bryan
>>
>>
>> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <jo...@gmail.com> wrote:
>> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i
>> > can confirm there are users at that range for it.
>> >
>> > Thanks
>> >
>> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <ry...@gmail.com>
>> wrote:
>> >> I've found that on a single node getSFTP is able to pull more files off
>> a
>> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a
>> max
>> >> selects so it is requiring way more connections (one per file?) and
>> >> concurrent threads to keep up.
>> >>
>> >> Was wondering if anyone is using List/Fetch at scale? In the multi TB's
>> a
>> >> day range?
>> >>
>> >> Thanks,
>> >> Ryan
>>

Re: FetchSFTP vs GetSFTP

Posted by Ryan Ward <ry...@gmail.com>.

Joe/Bryan Thanks!

I believe the one specific file per concurrent task/connection (and too
many threads) is the issue I have we have a lot of small files and often
times backed up . I'm going to drop the task count to take advantage of the
pooling. Is it possible to have Fetch do batches vs a single file? Would
that improve throughput? Also is that 10 seconds configurable?

Some background: I'm converting 2 single nodes into a 5 node cluster and
trying to figure out the best approach.

Thanks again!



On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <bb...@gmail.com> wrote:

> Ryan,
>
> Personally I don't have experience running these processors at scale,
> but from a code perspective they are fundamentally different...
>
> GetSFTP is a source processor, meaning is not being fed by an upstream
> connection, so when it executes it can create a connection and
> retrieve up to max-selects during that one execution.
>
> FetchSFTP is being told to fetch one specific file, typically through
> attributes on incoming flow files, so the concept of max-selects
> doesn't really apply because there is only thing to select during an
> execution of the processor.
>
> FetchSFTP does employ connection pooling behind the scenes such that
> it will keep open a connection for each concurrent task, as long as
> each connection continues to be used with in 10 seconds.
>
> -Bryan
>
>
> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <jo...@gmail.com> wrote:
> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i
> > can confirm there are users at that range for it.
> >
> > Thanks
> >
> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <ry...@gmail.com>
> wrote:
> >> I've found that on a single node getSFTP is able to pull more files off
> a
> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a
> max
> >> selects so it is requiring way more connections (one per file?) and
> >> concurrent threads to keep up.
> >>
> >> Was wondering if anyone is using List/Fetch at scale? In the multi TB's
> a
> >> day range?
> >>
> >> Thanks,
> >> Ryan
>

Re: FetchSFTP vs GetSFTP

Posted by Bryan Bende <bb...@gmail.com>.

Ryan,

Personally I don't have experience running these processors at scale,
but from a code perspective they are fundamentally different...

GetSFTP is a source processor, meaning is not being fed by an upstream
connection, so when it executes it can create a connection and
retrieve up to max-selects during that one execution.

FetchSFTP is being told to fetch one specific file, typically through
attributes on incoming flow files, so the concept of max-selects
doesn't really apply because there is only thing to select during an
execution of the processor.

FetchSFTP does employ connection pooling behind the scenes such that
it will keep open a connection for each concurrent task, as long as
each connection continues to be used with in 10 seconds.

-Bryan

On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <jo...@gmail.com> wrote:
> Ryan - dont know the code specifics behind FetchSFTP off-hand but i
> can confirm there are users at that range for it.
>
> Thanks
>
> On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <ry...@gmail.com> wrote:
>> I've found that on a single node getSFTP is able to pull more files off a
>> remote server than Fetch in a cluster. I noticed Fetch doesn't have a max
>> selects so it is requiring way more connections (one per file?) and
>> concurrent threads to keep up.
>>
>> Was wondering if anyone is using List/Fetch at scale? In the multi TB's a
>> day range?
>>
>> Thanks,
>> Ryan

Re: FetchSFTP vs GetSFTP

Posted by Joe Witt <jo...@gmail.com>.

Ryan - dont know the code specifics behind FetchSFTP off-hand but i
can confirm there are users at that range for it.

Thanks

On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <ry...@gmail.com> wrote:
> I've found that on a single node getSFTP is able to pull more files off a
> remote server than Fetch in a cluster. I noticed Fetch doesn't have a max
> selects so it is requiring way more connections (one per file?) and
> concurrent threads to keep up.
>
> Was wondering if anyone is using List/Fetch at scale? In the multi TB's a
> day range?
>
> Thanks,
> Ryan