You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by "Irizarry Jr., Nazario" <na...@mitre.org> on 2017/02/06 15:00:19 UTC

Re: [DISCUSS] Run Once scheduling

This was submitted as NIFI-3422 and PR 1458.

Thanks,

Naz Irizarry
MITRE Corp.
617-893-0074



> On Jan 31, 2017, at 10:28 AM, Joe Witt <jo...@gmail.com> wrote:
> 
> Hello
> 
> You will first want to create a JIRA describing the work/idea being
> done.  Then in the commit log be sure to reference NIFI-XXXX.
> 
> Take a look here for a helpful guide on how best to help the community
> land contributions.
> 
> https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide
> 
> Thanks
> Joe
> 
> On Tue, Jan 31, 2017 at 10:17 AM, Irizarry Jr., Nazario <na...@mitre.org> wrote:
>> I am about to submit a PR for an implementation of the run-once scheduling.  There is no outstanding JIRA ticket on this so what kind of NIFI-XXXX or other labeling should I put into the title of the PR?
>> 
>> Thanks,
>> 
>> Naz Irizarry
>> MITRE Corp.
>> 617-893-0074
>> 
>> 
>> 
>>> On Jan 12, 2017, at 3:55 PM, Irizarry Jr., Nazario <na...@mitre.org> wrote:
>>> 
>>> I think it is a matter of the model in one's head.  If one thinks of a continuous activation paradigm the green arrow versus red square indicate what you point out.  On the other hand in an ad-hoc run-once paradigm the green arrow is a nice succinct indicator of what has not run yet.  In an analytics environment processing can take minutes to hours for some processors.  As  processing goes on the processors with the remaining green arrows indicate what is left to complete in the “visual script.”
>>> 
>>> Consider the following example. Say there there are five processors. The first processor, say A, makes a query and gets data.  Depending on what I know about today’s input to A the output should be directed to B1, B2, B3, or B4.  The B's are actually variations on a particular analytic algorithm and most of the time only one of them needs to be used.  On one day (based on external knowledge) I click on A and B1 and then the Start arrow.  On another day I modify the query, click on A and B2 and then click on the Start arrow.  etc, Clearly I could have four flows and I could start/stop entire flows.  But, as the number of processing stages increases and the number of processing alternatives increases at each stage the combinatorial growth makes distinct flows painful to manage.  Sometimes it is easier to have one all encompassing flow and then allow the analyst to shift click the portions they want to invoke for the next “run."
>>> 
>>> 
>>> Naz Irizarry
>>> MITRE Corp.
>>> 617-893-0074
>>> 
>>> 
>>> 
>>>> On Jan 12, 2017, at 2:14 PM, Joe Witt <jo...@gmail.com> wrote:
>>>> 
>>>> Naz
>>>> 
>>>> The green arrow vs red square says "scheduled to execute" vs "not
>>>> scheduled to execute".  For most processors, such as those which take
>>>> input flow files from a connection, even if they're scheduled to run
>>>> they're not going to be executed unless there is work to do (data
>>>> sitting in the queue) and space available (on all destination
>>>> relationships).  Because of this I'm suggesting to consider just
>>>> leaving them all scheduled to execute even though they won't actually
>>>> be doing anything most of the time.  The stats on each component tell
>>>> you how many times it was actually invoked and how much data it
>>>> processed, etc..  So you'll see that they're not doing anything most
>>>> of the time.
>>>> 
>>>> You mentioned not wanting to have to do anything manual yet run once
>>>> would be a manual construct, right?
>>>> 
>>>> I dont mean to suggest I'm closed off to the idea of a run once
>>>> concept I just really want to understand your use case better.
>>>> 
>>>> Thanks
>>>> Joe