You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Daniel Einspanjer <da...@designet.com> on 2016/10/30 23:07:23 UTC

ExecuteSQL just once

If I put a SQL statement into an ExecuteSQL processor along the lines of:

SELECT * FROM foo WHERE thedate >= '2016-01-01'

and I build a flow that does something with that data, I'f found if I leave
the ExecuteSQL processor running, it will continually run the query,
resulting in lots of duplicate data.

What is the proper way to handle something that would be sort of a one time
event?

Right now, I'm just setting a long run schedule time and making sure I turn
the step off after it finishes the first run.

-Daniel

Re: ExecuteSQL just once

Posted by Matt Burgess <ma...@apache.org>.
Daniel,

One thing you can do (albeit manually) for a "run once" is to schedule
ExecuteSQL for some large interval (3600 sec for example), then start
and immediately stop the processor. It will run once (stopping the
processor doesn't interrupt the current execution, it just prevents it
from being scheduled again). The Jira Bryan mentioned is [1], if
implemented you could schedule a processor to run N times, and it
won't be scheduled after that number of executions has occurred.

Back to the custom query, can you provide an example of such a query
that you'd like to use? What if we were to add the ability to
QueryDatabaseTable and GenerateTableFetch where you can add WHERE
clauses, GROUP BYs, ORDER BYs, etc.?  We avoided the ability for the
user to provide a custom query, because we'd have to parse the
arbitrary SQL, insert our own clauses, and offer a way for the user to
access state properties (possibly with Expression Language but they'd
be "transient attributes", only available to that processor). It gets
a little complex, but there is precedence with UpdateAttribute [2].

There are a couple of workarounds. For example you could do the
inverse of trying to provide a custom query for the DB processor(s),
where you let GenerateTableFetch generate SQL queries, then use
ReplaceText to update them inline with your clauses. Alternatively,
the Column Names property is used verbatim in the SELECT clause, so
you could put a more complex clause in there (maybe a nested select
ending with AS so the generated SQL makes it look like a table alias
or something).

It's definitely a place where we can improve the user experience
and/or feature set, I'm interested to hear your thoughts on this and
anything else :)

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-1407
[2] https://issues.apache.org/jira/browse/NIFI-1582

On Sun, Oct 30, 2016 at 8:03 PM, Daniel Einspanjer
<da...@designet.com> wrote:
> Unfortunately, neither of those allow you to specify a custom query, so if
> you are doing anything other than a select *, you have a problem.
>
> The use case I was working with was a "getting started" kind of thing.
> Slurp some data in from a DB and try to work with it in various flows.
> Unfortunately, my laptop got rather unhappy when I accidentally left the
> schedule for ExecuteSQL set to 0 seconds and ran a large query.  The step
> spewed out several repetitions of that query which bogged things down and
> prevented me from being able to stop other processors and clear queues
> because the downstream processors had already started processing the first
> of the large flow files.
>
> I'd say probably something that would make a convenient "run once unless
> this flag is cleared" type setting would have been what I was looking for.
> I admit it isn't quite as likely to be needed in a normal production
> environment though, just development.
>
> -Daniel
>
> On Sun, Oct 30, 2016 at 7:55 PM Matt Burgess <ma...@gmail.com> wrote:
>>
>> We have QueryDatabaseTable and GenerateTableFetch for such things,
>> especially the former for 1.0.0. QueryDatabaseTable allows you to pick
>> "thedate" as a max-value column that it keeps track of, and you can specify
>> the initial maximum value (not sure what version that was added in as I'm
>> AFK).
>>
>> Sent from my iPhone
>>
>> On Oct 30, 2016, at 7:11 PM, Joe Witt <jo...@gmail.com> wrote:
>>
>> Daniel
>>
>> We don't have mechanisms provided for one time execution at this point.
>> Everything is oriented toward continuous execution.  That said this is
>> starting to come up more and more so seems like we should offer run once
>> scheduling or something like that.
>>
>> Would that give you what you are looking for?  Can you talk about your use
>> case a bit more ?
>>
>> Thanks
>> Joe
>>
>> On Oct 30, 2016 7:07 PM, "Daniel Einspanjer"
>> <da...@designet.com> wrote:
>> >
>> > If I put a SQL statement into an ExecuteSQL processor along the lines
>> > of:
>> >
>> > SELECT * FROM foo WHERE thedate >= '2016-01-01'
>> >
>> > and I build a flow that does something with that data, I'f found if I
>> > leave the ExecuteSQL processor running, it will continually run the query,
>> > resulting in lots of duplicate data.
>> >
>> > What is the proper way to handle something that would be sort of a one
>> > time event?
>> >
>> > Right now, I'm just setting a long run schedule time and making sure I
>> > turn the step off after it finishes the first run.
>> >
>> > -Daniel

Re: ExecuteSQL just once

Posted by Bryan Bende <bb...@gmail.com>.
There is also a JIRA/PR out there to let processors specify default values
for things like Run Schedule. I would imagine if we had that, and we went
through all of the source processors and provided reasonable defaults, that
might have helped here. It still doesn't give you the "run once"
capability, but at least you wouldn't have had a ton of repeated executions
of the expensive query.

On Sunday, October 30, 2016, Daniel Einspanjer <
daniel.einspanjer@designet.com> wrote:

> Unfortunately, neither of those allow you to specify a custom query, so if
> you are doing anything other than a select *, you have a problem.
>
> The use case I was working with was a "getting started" kind of thing.
> Slurp some data in from a DB and try to work with it in various flows.
> Unfortunately, my laptop got rather unhappy when I accidentally left the
> schedule for ExecuteSQL set to 0 seconds and ran a large query.  The step
> spewed out several repetitions of that query which bogged things down and
> prevented me from being able to stop other processors and clear queues
> because the downstream processors had already started processing the first
> of the large flow files.
>
> I'd say probably something that would make a convenient "run once unless
> this flag is cleared" type setting would have been what I was looking for.
> I admit it isn't quite as likely to be needed in a normal production
> environment though, just development.
>
> -Daniel
>
> On Sun, Oct 30, 2016 at 7:55 PM Matt Burgess <mattyb149@gmail.com
> <javascript:_e(%7B%7D,'cvml','mattyb149@gmail.com');>> wrote:
>
>> We have QueryDatabaseTable and GenerateTableFetch for such things,
>> especially the former for 1.0.0. QueryDatabaseTable allows you to pick
>> "thedate" as a max-value column that it keeps track of, and you can specify
>> the initial maximum value (not sure what version that was added in as I'm
>> AFK).
>>
>> Sent from my iPhone
>>
>> On Oct 30, 2016, at 7:11 PM, Joe Witt <joe.witt@gmail.com
>> <javascript:_e(%7B%7D,'cvml','joe.witt@gmail.com');>> wrote:
>>
>> Daniel
>>
>> We don't have mechanisms provided for one time execution at this point.
>> Everything is oriented toward continuous execution.  That said this is
>> starting to come up more and more so seems like we should offer run once
>> scheduling or something like that.
>>
>> Would that give you what you are looking for?  Can you talk about your
>> use case a bit more ?
>>
>> Thanks
>> Joe
>>
>> On Oct 30, 2016 7:07 PM, "Daniel Einspanjer" <daniel.einspanjer@designet.
>> com <javascript:_e(%7B%7D,'cvml','daniel.einspanjer@designet.com');>>
>> wrote:
>> >
>> > If I put a SQL statement into an ExecuteSQL processor along the lines
>> of:
>> >
>> > SELECT * FROM foo WHERE thedate >= '2016-01-01'
>> >
>> > and I build a flow that does something with that data, I'f found if I
>> leave the ExecuteSQL processor running, it will continually run the query,
>> resulting in lots of duplicate data.
>> >
>> > What is the proper way to handle something that would be sort of a one
>> time event?
>> >
>> > Right now, I'm just setting a long run schedule time and making sure I
>> turn the step off after it finishes the first run.
>> >
>> > -Daniel
>>
>>

-- 
Sent from Gmail Mobile

Re: ExecuteSQL just once

Posted by Daniel Einspanjer <da...@designet.com>.
Unfortunately, neither of those allow you to specify a custom query, so if
you are doing anything other than a select *, you have a problem.

The use case I was working with was a "getting started" kind of thing.
Slurp some data in from a DB and try to work with it in various flows.
Unfortunately, my laptop got rather unhappy when I accidentally left the
schedule for ExecuteSQL set to 0 seconds and ran a large query.  The step
spewed out several repetitions of that query which bogged things down and
prevented me from being able to stop other processors and clear queues
because the downstream processors had already started processing the first
of the large flow files.

I'd say probably something that would make a convenient "run once unless
this flag is cleared" type setting would have been what I was looking for.
I admit it isn't quite as likely to be needed in a normal production
environment though, just development.

-Daniel

On Sun, Oct 30, 2016 at 7:55 PM Matt Burgess <ma...@gmail.com> wrote:

> We have QueryDatabaseTable and GenerateTableFetch for such things,
> especially the former for 1.0.0. QueryDatabaseTable allows you to pick
> "thedate" as a max-value column that it keeps track of, and you can specify
> the initial maximum value (not sure what version that was added in as I'm
> AFK).
>
> Sent from my iPhone
>
> On Oct 30, 2016, at 7:11 PM, Joe Witt <jo...@gmail.com> wrote:
>
> Daniel
>
> We don't have mechanisms provided for one time execution at this point.
> Everything is oriented toward continuous execution.  That said this is
> starting to come up more and more so seems like we should offer run once
> scheduling or something like that.
>
> Would that give you what you are looking for?  Can you talk about your use
> case a bit more ?
>
> Thanks
> Joe
>
> On Oct 30, 2016 7:07 PM, "Daniel Einspanjer" <
> daniel.einspanjer@designet.com> wrote:
> >
> > If I put a SQL statement into an ExecuteSQL processor along the lines of:
> >
> > SELECT * FROM foo WHERE thedate >= '2016-01-01'
> >
> > and I build a flow that does something with that data, I'f found if I
> leave the ExecuteSQL processor running, it will continually run the query,
> resulting in lots of duplicate data.
> >
> > What is the proper way to handle something that would be sort of a one
> time event?
> >
> > Right now, I'm just setting a long run schedule time and making sure I
> turn the step off after it finishes the first run.
> >
> > -Daniel
>
>

Re: ExecuteSQL just once

Posted by Matt Burgess <ma...@gmail.com>.
We have QueryDatabaseTable and GenerateTableFetch for such things, especially the former for 1.0.0. QueryDatabaseTable allows you to pick "thedate" as a max-value column that it keeps track of, and you can specify the initial maximum value (not sure what version that was added in as I'm AFK).

Sent from my iPhone

> On Oct 30, 2016, at 7:11 PM, Joe Witt <jo...@gmail.com> wrote:
> 
> Daniel
> 
> We don't have mechanisms provided for one time execution at this point.  Everything is oriented toward continuous execution.  That said this is starting to come up more and more so seems like we should offer run once scheduling or something like that.  
> 
> Would that give you what you are looking for?  Can you talk about your use case a bit more ?
> 
> Thanks
> Joe
> 
> On Oct 30, 2016 7:07 PM, "Daniel Einspanjer" <da...@designet.com> wrote:
> >
> > If I put a SQL statement into an ExecuteSQL processor along the lines of:
> >
> > SELECT * FROM foo WHERE thedate >= '2016-01-01'
> >
> > and I build a flow that does something with that data, I'f found if I leave the ExecuteSQL processor running, it will continually run the query, resulting in lots of duplicate data.
> >
> > What is the proper way to handle something that would be sort of a one time event?
> >
> > Right now, I'm just setting a long run schedule time and making sure I turn the step off after it finishes the first run.
> >
> > -Daniel

Re: ExecuteSQL just once

Posted by Joe Witt <jo...@gmail.com>.
Daniel

We don't have mechanisms provided for one time execution at this point.
Everything is oriented toward continuous execution.  That said this is
starting to come up more and more so seems like we should offer run once
scheduling or something like that.

Would that give you what you are looking for?  Can you talk about your use
case a bit more ?

Thanks
Joe

On Oct 30, 2016 7:07 PM, "Daniel Einspanjer" <da...@designet.com>
wrote:
>
> If I put a SQL statement into an ExecuteSQL processor along the lines of:
>
> SELECT * FROM foo WHERE thedate >= '2016-01-01'
>
> and I build a flow that does something with that data, I'f found if I
leave the ExecuteSQL processor running, it will continually run the query,
resulting in lots of duplicate data.
>
> What is the proper way to handle something that would be sort of a one
time event?
>
> Right now, I'm just setting a long run schedule time and making sure I
turn the step off after it finishes the first run.
>
> -Daniel