You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Joe Gresock <jg...@gmail.com> on 2016/10/28 11:30:09 UTC

GetSFTP backpressure question

I have a NiFilosophical question that came up when I had a GetSFTP
processor running to a back-pressured connection.

My GetSFTP is configured with max selects = 100, and the files in the
remote directory are nearly 1GB each.  The queue has a backpressure of 2GB,
and I assumed each run of GetSFTP would stop feeding files once it hit
backpressure.

I was initially puzzled when I started periodically seeing huge backlogs
(71GB) on each worker in the cluster in this particular queue, until I
looked at the queued count/bytes stats (very useful tool, btw):

Queued bytes statistics <https://imagebin.ca/v/301KDHEa1lCk>
Queued count statistics <https://imagebin.ca/v/301JqnUcGXLF>

Now it's evident that GetSFTP continues to emit files until it hits the max
selects, regardless of backpressure.  I think I understand why backpressure
couldn't necessarily trump this behavior (e.g., what if a processor needed
to emit a query result set in batches.. what would you do with the flow
files it wanted to emit if you suddenly hit backpressure?)

So my questions are:
- Do you think it's the user's responsibility to be aware of cases when
backpressure is overridden by a processor's implementation?  I think this
is important to understand, because backpressure is usually in place to
prevent a full disk, which is a fairly critical requirement.
- Is there something we can do to document this so it's more universally
understood?
- Perhaps the GetSFTP Max Selects property can indicate that it will
override backpressure?  In which case, are there other processors that
would need similar documentation?
- Or do we want a more universal approach, like putting this caveat in the
general documentation?

Joe

-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: GetSFTP backpressure question

Posted by Andrew Grande <ap...@gmail.com>.
UpdateAttribute is a good example of a soft threshold behavior. By default,
this processor operates in micro batches of 100. So, regardless if you set
BP to e.g. 5, it will commit 100 events and then be throttled back by the
connection. A user will see 100 events in the connection even though 5 was
the limit.

I'd say it's important to understand this behavior and that the flow will
be enabled again as soon as the backlog drops to under 5, but not sure if
there's a generic fix, or even if a fix is due.

Andrew

On Fri, Oct 28, 2016, 7:55 AM Joe Witt <jo...@gmail.com> wrote:

> Great questions and discussion points here and I agree with your
> statement about the importance of honoring back pressure targets the
> user believes they set.
>
> The way back pressure works is that before a processor is given a
> thread to execute (each onTrigger cycle) the framework checks all
> possible output relationships and ensures that at that moment in time
> all of them have space available according to the limits set on those
> connection (size or number of things).  Once that processor is given
> the thread to execute its onTrigger cycle it is up to that processor
> to be a good steward and the framework does offer a method for that
> processor to check if all destinations have space available which is
> important if for efficiency reasons it chooses to do more than one
> thing at a time.  The processor doesn't get to know how close or how
> full the queues are that it writes to so that is important to
> understand as well.  To the processor the destinations are either full
> or have space available.
>
> This sort of back pressure is an optimistic approach and really means
> these are enforced as soft limits and as you point out can be exceeded
> in some cases.  It basically means that the back pressure target can
> be exceeded by however much data could be produced by a processor in a
> single execution cycle once it is given a thread.
>
> I believe the user's expectation is well articulated via the current
> mechanism of setting the max values on the connections and it is then
> important that processors get written or improved to better honor that
> or that they document for the user under what conditions they could
> exceed the backpressure target.
>
> Thanks
> Joe
>
> On Fri, Oct 28, 2016 at 7:30 AM, Joe Gresock <jg...@gmail.com> wrote:
> > I have a NiFilosophical question that came up when I had a GetSFTP
> > processor running to a back-pressured connection.
> >
> > My GetSFTP is configured with max selects = 100, and the files in the
> > remote directory are nearly 1GB each.  The queue has a backpressure of
> 2GB,
> > and I assumed each run of GetSFTP would stop feeding files once it hit
> > backpressure.
> >
> > I was initially puzzled when I started periodically seeing huge backlogs
> > (71GB) on each worker in the cluster in this particular queue, until I
> > looked at the queued count/bytes stats (very useful tool, btw):
> >
> > Queued bytes statistics <https://imagebin.ca/v/301KDHEa1lCk>
> > Queued count statistics <https://imagebin.ca/v/301JqnUcGXLF>
> >
> > Now it's evident that GetSFTP continues to emit files until it hits the
> max
> > selects, regardless of backpressure.  I think I understand why
> backpressure
> > couldn't necessarily trump this behavior (e.g., what if a processor
> needed
> > to emit a query result set in batches.. what would you do with the flow
> > files it wanted to emit if you suddenly hit backpressure?)
> >
> > So my questions are:
> > - Do you think it's the user's responsibility to be aware of cases when
> > backpressure is overridden by a processor's implementation?  I think this
> > is important to understand, because backpressure is usually in place to
> > prevent a full disk, which is a fairly critical requirement.
> > - Is there something we can do to document this so it's more universally
> > understood?
> > - Perhaps the GetSFTP Max Selects property can indicate that it will
> > override backpressure?  In which case, are there other processors that
> > would need similar documentation?
> > - Or do we want a more universal approach, like putting this caveat in
> the
> > general documentation?
> >
> > Joe
> >
> > --
> > I know what it is to be in need, and I know what it is to have plenty.  I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I can
> do
> > all this through him who gives me strength.    *-Philippians 4:12-13*
>

Re: GetSFTP backpressure question

Posted by Joe Witt <jo...@gmail.com>.
Great questions and discussion points here and I agree with your
statement about the importance of honoring back pressure targets the
user believes they set.

The way back pressure works is that before a processor is given a
thread to execute (each onTrigger cycle) the framework checks all
possible output relationships and ensures that at that moment in time
all of them have space available according to the limits set on those
connection (size or number of things).  Once that processor is given
the thread to execute its onTrigger cycle it is up to that processor
to be a good steward and the framework does offer a method for that
processor to check if all destinations have space available which is
important if for efficiency reasons it chooses to do more than one
thing at a time.  The processor doesn't get to know how close or how
full the queues are that it writes to so that is important to
understand as well.  To the processor the destinations are either full
or have space available.

This sort of back pressure is an optimistic approach and really means
these are enforced as soft limits and as you point out can be exceeded
in some cases.  It basically means that the back pressure target can
be exceeded by however much data could be produced by a processor in a
single execution cycle once it is given a thread.

I believe the user's expectation is well articulated via the current
mechanism of setting the max values on the connections and it is then
important that processors get written or improved to better honor that
or that they document for the user under what conditions they could
exceed the backpressure target.

Thanks
Joe

On Fri, Oct 28, 2016 at 7:30 AM, Joe Gresock <jg...@gmail.com> wrote:
> I have a NiFilosophical question that came up when I had a GetSFTP
> processor running to a back-pressured connection.
>
> My GetSFTP is configured with max selects = 100, and the files in the
> remote directory are nearly 1GB each.  The queue has a backpressure of 2GB,
> and I assumed each run of GetSFTP would stop feeding files once it hit
> backpressure.
>
> I was initially puzzled when I started periodically seeing huge backlogs
> (71GB) on each worker in the cluster in this particular queue, until I
> looked at the queued count/bytes stats (very useful tool, btw):
>
> Queued bytes statistics <https://imagebin.ca/v/301KDHEa1lCk>
> Queued count statistics <https://imagebin.ca/v/301JqnUcGXLF>
>
> Now it's evident that GetSFTP continues to emit files until it hits the max
> selects, regardless of backpressure.  I think I understand why backpressure
> couldn't necessarily trump this behavior (e.g., what if a processor needed
> to emit a query result set in batches.. what would you do with the flow
> files it wanted to emit if you suddenly hit backpressure?)
>
> So my questions are:
> - Do you think it's the user's responsibility to be aware of cases when
> backpressure is overridden by a processor's implementation?  I think this
> is important to understand, because backpressure is usually in place to
> prevent a full disk, which is a fairly critical requirement.
> - Is there something we can do to document this so it's more universally
> understood?
> - Perhaps the GetSFTP Max Selects property can indicate that it will
> override backpressure?  In which case, are there other processors that
> would need similar documentation?
> - Or do we want a more universal approach, like putting this caveat in the
> general documentation?
>
> Joe
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*