You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Scott Wagner <sw...@beenverified.com> on 2017/02/27 22:34:26 UTC

How to gracefully handle a circular graph?

Hello all,

     I have created a graph where I am downloading a number of rows from 
an SQL database, and each row defines a range of numbers (100-200, 
700-1500, etc.).  What I am then doing on the NiFi side is generating an 
individual FlowFile for each number in that range.  The way that I was 
accomplishing this was by setting attributes to the "current" value to 
the lower boundary, and an attribute of the upper boundary, and then 
creating two queues off the "success" output for a Processor (the 
ReplaceText processor in the bottom right of the image), one of which 
goes on to process that number's record (going off the bottom right in 
the picture), and the other one of which goes off to a processor to 
increment the "current" number, and will then forward it to the 
processor that will check to make sure that "current" is less than or 
equal to "upper boundary".

     This works great, until the queues end up filling up.  Once this 
happens, I have a gridlock situation where none of the processors in 
this triangle are running any longer, because they all have a full 
output queue.  I have tried searching the Internet and put a little 
thought into how I could make it so that my "Check if done" processor 
would prefer entries that are coming in from the circular portion of the 
graph, but so far haven't been able to come up with anything.  What I 
have tried is making both of the input queues to "Check if done" go 
through a funnel, and set an Oldest FlowFile prioritizer, but it still 
eventually ends up filling up the entire triangle of queues.



     Does anyone have a suggestion as to how I could gracefully handle a 
situation like this?  I appreciate any advice.

Thanks!

- Scott Wagner


---
This email has been checked for viruses by AVG.
http://www.avg.com

Re: How to gracefully handle a circular graph?

Posted by Scott Wagner <sw...@beenverified.com>.

Bryan,

     This is a one-time batch job where I will be downloading a few 
million records, each of which has a range, and then pushing those 
through this process.  I hadn't thought of putting the ControlRate in 
front of the top-right corner.  I'm sure that would have worked, but 
it's just tough to tell at which rate they are getting exhausted from 
the loop, so it would probably take some experimentation to get it to 
the point where it would be able to run unsupervised.

     I had increased the back-pressure threshold to 100k flowfiles on 
each leg of the triangle, and it still ended up doing the same thing, it 
just took a lot longer to get into that position.

     I have implemented Matt's suggestion of using an ExecuteScript 
processor to explode out the number range into the content of the 
FlowFile and then a SplitText to get it to turn into the arbitrary 
number of FlowFiles, and that is working for me now.

Thanks for your time and suggestion!

- Scott

> Bryan Bende <ma...@gmail.com>
> Tuesday, February 28, 2017 8:53 AM
> Scott,
>
> Do you have a constant flow of data from your database, or is this 
> more like a large batch comes in and processes in NiFi for a while and 
> then you some time later you pull another batch?
>
> If it is more like the batch scenario, you might be able to stick a 
> ControlRate processor before your "check if done" processor to 
> throttle the flow files entering the loop. This obviously doesn't work 
> well if you have a constant flow of new data entering the loop because 
> it will just make everything before the loop back up as well, but it 
> might be reasonable while working on a given batch.
>
> You can also increase the back-pressure threshold on all of the queues 
> if you have enough memory allocated to the NiFi JVM. Right-click on 
> the queues and configure, they default to 10k flow files or 1GB I 
> believe, based on the screenshot they are hitting the 10k threshold so 
> you could bump this up a bit to give more breathing room.
>
> -Bryan
>
>
>
> Matt Foley <ma...@apache.org>
> Monday, February 27, 2017 5:21 PM
>
> If I understand correctly, your desired goal is for each input row 
> that specifies a range, A to A+N, you would generate a sequence of N 
> (or perhaps N+1) flowfiles, right?  And the only difference in each 
> flowfile is that you\u2019ve Replaced the range specification with a single 
> number from that range?
>
> I would suggest that at the level of the row input, you use 
> ExecuteScript to expand each input row into N rows, with the 
> substituted number values, then run that through SplitText, to get one 
> row per flowfile.  This should be way more efficient, as well as much 
> safer than a cyclic graph.
>
> Cheers,
>
> --Matt
>
> *From: *Scott Wagner <sw...@beenverified.com>
> *Reply-To: *"users@nifi.apache.org" <us...@nifi.apache.org>
> *Date: *Monday, February 27, 2017 at 2:34 PM
> *To: *"users@nifi.apache.org" <us...@nifi.apache.org>
> *Subject: *How to gracefully handle a circular graph?
>
> Hello all,
>
>     I have created a graph where I am downloading a number of rows 
> from an SQL database, and each row defines a range of numbers 
> (100-200, 700-1500, etc.).  What I am then doing on the NiFi side is 
> generating an individual FlowFile for each number in that range.  The 
> way that I was accomplishing this was by setting attributes to the 
> "current" value to the lower boundary, and an attribute of the upper 
> boundary, and then creating two queues off the "success" output for a 
> Processor (the ReplaceText processor in the bottom right of the 
> image), one of which goes on to process that number's record (going 
> off the bottom right in the picture), and the other one of which goes 
> off to a processor to increment the "current" number, and will then 
> forward it to the processor that will check to make sure that 
> "current" is less than or equal to "upper boundary".
>
>     This works great, until the queues end up filling up.  Once this 
> happens, I have a gridlock situation where none of the processors in 
> this triangle are running any longer, because they all have a full 
> output queue.  I have tried searching the Internet and put a little 
> thought into how I could make it so that my "Check if done" processor 
> would prefer entries that are coming in from the circular portion of 
> the graph, but so far haven't been able to come up with anything.  
> What I have tried is making both of the input queues to "Check if 
> done" go through a funnel, and set an Oldest FlowFile prioritizer, but 
> it still eventually ends up filling up the entire triangle of queues.
>
>
>
>     Does anyone have a suggestion as to how I could gracefully handle 
> a situation like this?  I appreciate any advice.
>
> Thanks!
>
> - Scott Wagner
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
> 	
>
> Virus-free. www.avg.com 
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
>
>
> Scott Wagner <ma...@beenverified.com>
> Monday, February 27, 2017 4:34 PM
> Hello all,
>
>     I have created a graph where I am downloading a number of rows 
> from an SQL database, and each row defines a range of numbers 
> (100-200, 700-1500, etc.).  What I am then doing on the NiFi side is 
> generating an individual FlowFile for each number in that range.  The 
> way that I was accomplishing this was by setting attributes to the 
> "current" value to the lower boundary, and an attribute of the upper 
> boundary, and then creating two queues off the "success" output for a 
> Processor (the ReplaceText processor in the bottom right of the 
> image), one of which goes on to process that number's record (going 
> off the bottom right in the picture), and the other one of which goes 
> off to a processor to increment the "current" number, and will then 
> forward it to the processor that will check to make sure that 
> "current" is less than or equal to "upper boundary".
>
>     This works great, until the queues end up filling up.  Once this 
> happens, I have a gridlock situation where none of the processors in 
> this triangle are running any longer, because they all have a full 
> output queue.  I have tried searching the Internet and put a little 
> thought into how I could make it so that my "Check if done" processor 
> would prefer entries that are coming in from the circular portion of 
> the graph, but so far haven't been able to come up with anything.  
> What I have tried is making both of the input queues to "Check if 
> done" go through a funnel, and set an Oldest FlowFile prioritizer, but 
> it still eventually ends up filling up the entire triangle of queues.
>
>
>
>     Does anyone have a suggestion as to how I could gracefully handle 
> a situation like this?  I appreciate any advice.
>
> Thanks!
>
> - Scott Wagner
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
> 	Virus-free. www.avg.com 
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
>
>

Re: How to gracefully handle a circular graph?

Posted by Bryan Bende <bb...@gmail.com>.

Scott,

Do you have a constant flow of data from your database, or is this more
like a large batch comes in and processes in NiFi for a while and then you
some time later you pull another batch?

If it is more like the batch scenario, you might be able to stick a
ControlRate processor before your "check if done" processor to throttle the
flow files entering the loop. This obviously doesn't work well if you have
a constant flow of new data entering the loop because it will just make
everything before the loop back up as well, but it might be reasonable
while working on a given batch.

You can also increase the back-pressure threshold on all of the queues if
you have enough memory allocated to the NiFi JVM. Right-click on the queues
and configure, they default to 10k flow files or 1GB I believe, based on
the screenshot they are hitting the 10k threshold so you could bump this up
a bit to give more breathing room.

-Bryan


On Mon, Feb 27, 2017 at 6:21 PM, Matt Foley <ma...@apache.org> wrote:

> If I understand correctly, your desired goal is for each input row that
> specifies a range, A to A+N, you would generate a sequence of N (or perhaps
> N+1) flowfiles, right?  And the only difference in each flowfile is that
> you’ve Replaced the range specification with a single number from that
> range?
>
>
>
> I would suggest that at the level of the row input, you use ExecuteScript
> to expand each input row into N rows, with the substituted number values,
> then run that through SplitText, to get one row per flowfile.  This should
> be way more efficient, as well as much safer than a cyclic graph.
>
>
>
> Cheers,
>
> --Matt
>
>
>
> *From: *Scott Wagner <sw...@beenverified.com>
> *Reply-To: *"users@nifi.apache.org" <us...@nifi.apache.org>
> *Date: *Monday, February 27, 2017 at 2:34 PM
> *To: *"users@nifi.apache.org" <us...@nifi.apache.org>
> *Subject: *How to gracefully handle a circular graph?
>
>
>
> Hello all,
>
>     I have created a graph where I am downloading a number of rows from an
> SQL database, and each row defines a range of numbers (100-200, 700-1500,
> etc.).  What I am then doing on the NiFi side is generating an individual
> FlowFile for each number in that range.  The way that I was accomplishing
> this was by setting attributes to the "current" value to the lower
> boundary, and an attribute of the upper boundary, and then creating two
> queues off the "success" output for a Processor (the ReplaceText processor
> in the bottom right of the image), one of which goes on to process that
> number's record (going off the bottom right in the picture), and the other
> one of which goes off to a processor to increment the "current" number, and
> will then forward it to the processor that will check to make sure that
> "current" is less than or equal to "upper boundary".
>
>     This works great, until the queues end up filling up.  Once this
> happens, I have a gridlock situation where none of the processors in this
> triangle are running any longer, because they all have a full output
> queue.  I have tried searching the Internet and put a little thought into
> how I could make it so that my "Check if done" processor would prefer
> entries that are coming in from the circular portion of the graph, but so
> far haven't been able to come up with anything.  What I have tried is
> making both of the input queues to "Check if done" go through a funnel, and
> set an Oldest FlowFile prioritizer, but it still eventually ends up filling
> up the entire triangle of queues.
>
>
>
>     Does anyone have a suggestion as to how I could gracefully handle a
> situation like this?  I appreciate any advice.
>
> Thanks!
>
> - Scott Wagner
>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
> Virus-free. www.avg.com
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
>
>

Re: How to gracefully handle a circular graph?

Posted by Scott Wagner <sw...@beenverified.com>.

Matt,

     Thanks for the suggestion.  I did end up creating a ExecuteScript 
processor that expanded out the range of the numbers into the content, 
and it is working for this scenario.

Thanks again!

- Scott

> Matt Foley <ma...@apache.org>
> Monday, February 27, 2017 5:21 PM
>
> If I understand correctly, your desired goal is for each input row 
> that specifies a range, A to A+N, you would generate a sequence of N 
> (or perhaps N+1) flowfiles, right?  And the only difference in each 
> flowfile is that you\u2019ve Replaced the range specification with a single 
> number from that range?
>
> I would suggest that at the level of the row input, you use 
> ExecuteScript to expand each input row into N rows, with the 
> substituted number values, then run that through SplitText, to get one 
> row per flowfile.  This should be way more efficient, as well as much 
> safer than a cyclic graph.
>
> Cheers,
>
> --Matt
>
> *From: *Scott Wagner <sw...@beenverified.com>
> *Reply-To: *"users@nifi.apache.org" <us...@nifi.apache.org>
> *Date: *Monday, February 27, 2017 at 2:34 PM
> *To: *"users@nifi.apache.org" <us...@nifi.apache.org>
> *Subject: *How to gracefully handle a circular graph?
>
> Hello all,
>
>     I have created a graph where I am downloading a number of rows 
> from an SQL database, and each row defines a range of numbers 
> (100-200, 700-1500, etc.).  What I am then doing on the NiFi side is 
> generating an individual FlowFile for each number in that range.  The 
> way that I was accomplishing this was by setting attributes to the 
> "current" value to the lower boundary, and an attribute of the upper 
> boundary, and then creating two queues off the "success" output for a 
> Processor (the ReplaceText processor in the bottom right of the 
> image), one of which goes on to process that number's record (going 
> off the bottom right in the picture), and the other one of which goes 
> off to a processor to increment the "current" number, and will then 
> forward it to the processor that will check to make sure that 
> "current" is less than or equal to "upper boundary".
>
>     This works great, until the queues end up filling up.  Once this 
> happens, I have a gridlock situation where none of the processors in 
> this triangle are running any longer, because they all have a full 
> output queue.  I have tried searching the Internet and put a little 
> thought into how I could make it so that my "Check if done" processor 
> would prefer entries that are coming in from the circular portion of 
> the graph, but so far haven't been able to come up with anything.  
> What I have tried is making both of the input queues to "Check if 
> done" go through a funnel, and set an Oldest FlowFile prioritizer, but 
> it still eventually ends up filling up the entire triangle of queues.
>
>
>
>     Does anyone have a suggestion as to how I could gracefully handle 
> a situation like this?  I appreciate any advice.
>
> Thanks!
>
> - Scott Wagner
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
> 	
>
> Virus-free. www.avg.com 
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
>
>
> Scott Wagner <ma...@beenverified.com>
> Monday, February 27, 2017 4:34 PM
> Hello all,
>
>     I have created a graph where I am downloading a number of rows 
> from an SQL database, and each row defines a range of numbers 
> (100-200, 700-1500, etc.).  What I am then doing on the NiFi side is 
> generating an individual FlowFile for each number in that range.  The 
> way that I was accomplishing this was by setting attributes to the 
> "current" value to the lower boundary, and an attribute of the upper 
> boundary, and then creating two queues off the "success" output for a 
> Processor (the ReplaceText processor in the bottom right of the 
> image), one of which goes on to process that number's record (going 
> off the bottom right in the picture), and the other one of which goes 
> off to a processor to increment the "current" number, and will then 
> forward it to the processor that will check to make sure that 
> "current" is less than or equal to "upper boundary".
>
>     This works great, until the queues end up filling up.  Once this 
> happens, I have a gridlock situation where none of the processors in 
> this triangle are running any longer, because they all have a full 
> output queue.  I have tried searching the Internet and put a little 
> thought into how I could make it so that my "Check if done" processor 
> would prefer entries that are coming in from the circular portion of 
> the graph, but so far haven't been able to come up with anything.  
> What I have tried is making both of the input queues to "Check if 
> done" go through a funnel, and set an Oldest FlowFile prioritizer, but 
> it still eventually ends up filling up the entire triangle of queues.
>
>
>
>     Does anyone have a suggestion as to how I could gracefully handle 
> a situation like this?  I appreciate any advice.
>
> Thanks!
>
> - Scott Wagner
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
> 	Virus-free. www.avg.com 
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
>
>

Re: How to gracefully handle a circular graph?

Posted by Matt Foley <ma...@apache.org>.

If I understand correctly, your desired goal is for each input row that specifies a range, A to A+N, you would generate a sequence of N (or perhaps N+1) flowfiles, right?  And the only difference in each flowfile is that you’ve Replaced the range specification with a single number from that range?

 

I would suggest that at the level of the row input, you use ExecuteScript to expand each input row into N rows, with the substituted number values, then run that through SplitText, to get one row per flowfile.  This should be way more efficient, as well as much safer than a cyclic graph.

 

Cheers,

--Matt

 

From: Scott Wagner <sw...@beenverified.com>
Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
Date: Monday, February 27, 2017 at 2:34 PM
To: "users@nifi.apache.org" <us...@nifi.apache.org>
Subject: How to gracefully handle a circular graph?

 

Hello all,

    I have created a graph where I am downloading a number of rows from an SQL database, and each row defines a range of numbers (100-200, 700-1500, etc.).  What I am then doing on the NiFi side is generating an individual FlowFile for each number in that range.  The way that I was accomplishing this was by setting attributes to the "current" value to the lower boundary, and an attribute of the upper boundary, and then creating two queues off the "success" output for a Processor (the ReplaceText processor in the bottom right of the image), one of which goes on to process that number's record (going off the bottom right in the picture), and the other one of which goes off to a processor to increment the "current" number, and will then forward it to the processor that will check to make sure that "current" is less than or equal to "upper boundary".

    This works great, until the queues end up filling up.  Once this happens, I have a gridlock situation where none of the processors in this triangle are running any longer, because they all have a full output queue.  I have tried searching the Internet and put a little thought into how I could make it so that my "Check if done" processor would prefer entries that are coming in from the circular portion of the graph, but so far haven't been able to come up with anything.  What I have tried is making both of the input queues to "Check if done" go through a funnel, and set an Oldest FlowFile prioritizer, but it still eventually ends up filling up the entire triangle of queues.



    Does anyone have a suggestion as to how I could gracefully handle a situation like this?  I appreciate any advice.

Thanks!

- Scott Wagner

Virus-free. www.avg.com