You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Paul Bormans <pe...@gmail.com> on 2016/03/30 17:11:44 UTC

How to only take new rows using ExecuteSQL processor?

I'm evaluating Apache Nifi as data ingestion tool to load data from an
RDBMS into S3. A first test shows odd behavior where the same rows are
written to the flowfile over and over again while i expected that only new
rows are written.

In fact i was missing configuration options to specify what column could be
used to query only for new rows.

Taking a look at the processor implementation makes me believe that the
only option is to define a query including OFFSET n LIMIT m where "n" is
dynamically set based upon previous onTriggers; would this even be possible?

Some setup info:
nifi: 0.6.0
backend: postgresql
driver: postgresql-9.4.1208.jre6.jar
query: select * from addresses

More in general i don't see a use-case where the current ExecuteSQL
processor fits as a processor (without input flowfile). Someone can explain?

Paul

Re: How to only take new rows using ExecuteSQL processor?

Posted by Simon Ball <sb...@hortonworks.com>.
Hi Paul,

In the scenario where you need complex joins and incremental loads, the best bet is probably to create a view in your database with the query required. The QueryDatabaseTable can operate against this view as long as the view has a suitable ‘id’ column in it. 

That would provide a work around for the lack of a custom query at the moment. 

I also not on your ticket an excellent point about limiting the return batch size. You can achieve something like this with the query max time setting, but it would certainly be a good addition (and maybe deserves its own ticket).

Simon

> On 31 Mar 2016, at 10:15, Paul Bormans <pe...@gmail.com> wrote:
> 
> Hi Jou,
> 
> Thank you for the tip: great!!!
> 
> I guess nifi is too new still because i did some extensive searching on
> this subject and QueryDatabaseTable was not mentioned.....
> 
> This processor does exactly what i expect/need!
> 
> One shortcoming... maybe i should enter a ticket for this. Usually
> extraction of data from rdbms involves complex queries with joins and these
> are not supported as far as i can see. We could also extend the processor
> so that a configuration option is to specify the full query which i believe
> is much more flexible than enumerating columns from a specific table.
> 
> Paul
> 
> 
> 
> On Wed, Mar 30, 2016 at 5:19 PM, Joe Witt <jo...@gmail.com> wrote:
> 
>> Paul,
>> 
>> In Apache NiFi 0.6.0 if you're looking for a change capture type
>> mechanism to source from relational databases take a look at
>> QueryDatabaseTable [1].
>> 
>> That processor is new and any feedback and or contribs for it would be
>> awesome.
>> 
>> ExecuteSQL does have some time driven use cases to capture snapshots
>> and such but you're right that it doesn't sound like a good fit for
>> your case.
>> 
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
>> 
>> On Wed, Mar 30, 2016 at 9:11 AM, Paul Bormans <pe...@gmail.com> wrote:
>>> I'm evaluating Apache Nifi as data ingestion tool to load data from an
>>> RDBMS into S3. A first test shows odd behavior where the same rows are
>>> written to the flowfile over and over again while i expected that only
>> new
>>> rows are written.
>>> 
>>> In fact i was missing configuration options to specify what column could
>> be
>>> used to query only for new rows.
>>> 
>>> Taking a look at the processor implementation makes me believe that the
>>> only option is to define a query including OFFSET n LIMIT m where "n" is
>>> dynamically set based upon previous onTriggers; would this even be
>> possible?
>>> 
>>> Some setup info:
>>> nifi: 0.6.0
>>> backend: postgresql
>>> driver: postgresql-9.4.1208.jre6.jar
>>> query: select * from addresses
>>> 
>>> More in general i don't see a use-case where the current ExecuteSQL
>>> processor fits as a processor (without input flowfile). Someone can
>> explain?
>>> 
>>> Paul
>> 


Re: How to only take new rows using ExecuteSQL processor?

Posted by Paul Bormans <pe...@gmail.com>.
Issue submitted: https://issues.apache.org/jira/browse/NIFI-1706

On Thu, Mar 31, 2016 at 11:15 AM, Paul Bormans <pe...@gmail.com> wrote:

> Hi Jou,
>
> Thank you for the tip: great!!!
>
> I guess nifi is too new still because i did some extensive searching on
> this subject and QueryDatabaseTable was not mentioned.....
>
> This processor does exactly what i expect/need!
>
> One shortcoming... maybe i should enter a ticket for this. Usually
> extraction of data from rdbms involves complex queries with joins and these
> are not supported as far as i can see. We could also extend the processor
> so that a configuration option is to specify the full query which i believe
> is much more flexible than enumerating columns from a specific table.
>
> Paul
>
>
>
> On Wed, Mar 30, 2016 at 5:19 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Paul,
>>
>> In Apache NiFi 0.6.0 if you're looking for a change capture type
>> mechanism to source from relational databases take a look at
>> QueryDatabaseTable [1].
>>
>> That processor is new and any feedback and or contribs for it would be
>> awesome.
>>
>> ExecuteSQL does have some time driven use cases to capture snapshots
>> and such but you're right that it doesn't sound like a good fit for
>> your case.
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
>>
>> On Wed, Mar 30, 2016 at 9:11 AM, Paul Bormans <pe...@gmail.com> wrote:
>> > I'm evaluating Apache Nifi as data ingestion tool to load data from an
>> > RDBMS into S3. A first test shows odd behavior where the same rows are
>> > written to the flowfile over and over again while i expected that only
>> new
>> > rows are written.
>> >
>> > In fact i was missing configuration options to specify what column
>> could be
>> > used to query only for new rows.
>> >
>> > Taking a look at the processor implementation makes me believe that the
>> > only option is to define a query including OFFSET n LIMIT m where "n" is
>> > dynamically set based upon previous onTriggers; would this even be
>> possible?
>> >
>> > Some setup info:
>> > nifi: 0.6.0
>> > backend: postgresql
>> > driver: postgresql-9.4.1208.jre6.jar
>> > query: select * from addresses
>> >
>> > More in general i don't see a use-case where the current ExecuteSQL
>> > processor fits as a processor (without input flowfile). Someone can
>> explain?
>> >
>> > Paul
>>
>
>

Re: How to only take new rows using ExecuteSQL processor?

Posted by Paul Bormans <pe...@gmail.com>.
Hi Jou,

Thank you for the tip: great!!!

I guess nifi is too new still because i did some extensive searching on
this subject and QueryDatabaseTable was not mentioned.....

This processor does exactly what i expect/need!

One shortcoming... maybe i should enter a ticket for this. Usually
extraction of data from rdbms involves complex queries with joins and these
are not supported as far as i can see. We could also extend the processor
so that a configuration option is to specify the full query which i believe
is much more flexible than enumerating columns from a specific table.

Paul



On Wed, Mar 30, 2016 at 5:19 PM, Joe Witt <jo...@gmail.com> wrote:

> Paul,
>
> In Apache NiFi 0.6.0 if you're looking for a change capture type
> mechanism to source from relational databases take a look at
> QueryDatabaseTable [1].
>
> That processor is new and any feedback and or contribs for it would be
> awesome.
>
> ExecuteSQL does have some time driven use cases to capture snapshots
> and such but you're right that it doesn't sound like a good fit for
> your case.
>
> [1]
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
>
> On Wed, Mar 30, 2016 at 9:11 AM, Paul Bormans <pe...@gmail.com> wrote:
> > I'm evaluating Apache Nifi as data ingestion tool to load data from an
> > RDBMS into S3. A first test shows odd behavior where the same rows are
> > written to the flowfile over and over again while i expected that only
> new
> > rows are written.
> >
> > In fact i was missing configuration options to specify what column could
> be
> > used to query only for new rows.
> >
> > Taking a look at the processor implementation makes me believe that the
> > only option is to define a query including OFFSET n LIMIT m where "n" is
> > dynamically set based upon previous onTriggers; would this even be
> possible?
> >
> > Some setup info:
> > nifi: 0.6.0
> > backend: postgresql
> > driver: postgresql-9.4.1208.jre6.jar
> > query: select * from addresses
> >
> > More in general i don't see a use-case where the current ExecuteSQL
> > processor fits as a processor (without input flowfile). Someone can
> explain?
> >
> > Paul
>

Re: How to only take new rows using ExecuteSQL processor?

Posted by Joe Witt <jo...@gmail.com>.
Paul,

In Apache NiFi 0.6.0 if you're looking for a change capture type
mechanism to source from relational databases take a look at
QueryDatabaseTable [1].

That processor is new and any feedback and or contribs for it would be awesome.

ExecuteSQL does have some time driven use cases to capture snapshots
and such but you're right that it doesn't sound like a good fit for
your case.

[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html

On Wed, Mar 30, 2016 at 9:11 AM, Paul Bormans <pe...@gmail.com> wrote:
> I'm evaluating Apache Nifi as data ingestion tool to load data from an
> RDBMS into S3. A first test shows odd behavior where the same rows are
> written to the flowfile over and over again while i expected that only new
> rows are written.
>
> In fact i was missing configuration options to specify what column could be
> used to query only for new rows.
>
> Taking a look at the processor implementation makes me believe that the
> only option is to define a query including OFFSET n LIMIT m where "n" is
> dynamically set based upon previous onTriggers; would this even be possible?
>
> Some setup info:
> nifi: 0.6.0
> backend: postgresql
> driver: postgresql-9.4.1208.jre6.jar
> query: select * from addresses
>
> More in general i don't see a use-case where the current ExecuteSQL
> processor fits as a processor (without input flowfile). Someone can explain?
>
> Paul