You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by "Peter Wicks (pwicks)" <pw...@micron.com> on 2016/08/26 13:45:25 UTC

Kill-and-Fill Pattern?

I have a source SQL table that I'm reading with a SQL select statement.  I want to kill and fill a destination SQL table with this source data on an interval.

My non kill-and-fill pattern is: ExecuteSQL -> Avro To JSON -> JSON To SQL -> PutSQL.

I'm trying to come up with a good way to delete existing data first before loading new data.
One option I've considered is to mark the original Avro file with a UUID and add this attribute as a field in the destination table; then do a split off, ReplaceText, and delete all rows where the UUID doesn't match this batch.  I think this could work, but I'm worried about timing the SQL DELETE.  I kind of want the kill and the fill steps to happen in a single transaction.

The other issue is what happens if PutSQL has to go down for a while due to database downtime and I get several kill-and-fill batches piled up.  Is there a way I can use backpressure to make sure only a single file gets converted from JSON to SQL at a time in order to avoid mixing batches?
I also considered FlowFile expiration, but is there a way I can tell it NiFI to only expire a FlowFile when a new FlowFile has entered the queue? Ex: 1 flow file in queue, no expiration occurs. 2nd (newer) FlowFile enters queue then first file will expire itself.

Thanks,
  Peter

RE: Kill-and-Fill Pattern?

Posted by "Peter Wicks (pwicks)" <pw...@micron.com>.
Toivo,

I started down this path, but then came up with a broader solution (which I have not tested):


1.       Do a normal JSONToSQL

2.       Use MergeContent to group all of the FlowFiles from the same batch into a single new FlowFile using FlowFile Stream Merge Format.

3.       Update PutSQL to support Merged FlowFiles.

--Peter

From: Toivo Adams [mailto:toivo.adams@gmail.com]
Sent: Sunday, August 28, 2016 7:27 AM
To: users@nifi.apache.org
Subject: Re: Kill-and-Fill Pattern?

hi
Could new processor PutAvroSQL help?
Processor will use data in Avro format and insert all records at once.
thanks
toivo

2016-08-26 16:45 GMT+03:00 Peter Wicks (pwicks) <pw...@micron.com>>:
I have a source SQL table that I’m reading with a SQL select statement.  I want to kill and fill a destination SQL table with this source data on an interval.

My non kill-and-fill pattern is: ExecuteSQL -> Avro To JSON -> JSON To SQL -> PutSQL.

I’m trying to come up with a good way to delete existing data first before loading new data.
One option I’ve considered is to mark the original Avro file with a UUID and add this attribute as a field in the destination table; then do a split off, ReplaceText, and delete all rows where the UUID doesn’t match this batch.  I think this could work, but I’m worried about timing the SQL DELETE.  I kind of want the kill and the fill steps to happen in a single transaction.

The other issue is what happens if PutSQL has to go down for a while due to database downtime and I get several kill-and-fill batches piled up.  Is there a way I can use backpressure to make sure only a single file gets converted from JSON to SQL at a time in order to avoid mixing batches?
I also considered FlowFile expiration, but is there a way I can tell it NiFI to only expire a FlowFile when a new FlowFile has entered the queue? Ex: 1 flow file in queue, no expiration occurs. 2nd (newer) FlowFile enters queue then first file will expire itself.

Thanks,
  Peter


Re: Kill-and-Fill Pattern?

Posted by Toivo Adams <to...@gmail.com>.
hi

Could new processor PutAvroSQL help?
Processor will use data in Avro format and insert all records at once.

thanks
toivo

2016-08-26 16:45 GMT+03:00 Peter Wicks (pwicks) <pw...@micron.com>:

> I have a source SQL table that I’m reading with a SQL select statement.  I
> want to kill and fill a destination SQL table with this source data on an
> interval.
>
>
>
> My non kill-and-fill pattern is: ExecuteSQL -> Avro To JSON -> JSON To SQL
> -> PutSQL.
>
>
>
> I’m trying to come up with a good way to delete existing data first before
> loading new data.
>
> One option I’ve considered is to mark the original Avro file with a UUID
> and add this attribute as a field in the destination table; then do a split
> off, ReplaceText, and delete all rows where the UUID doesn’t match this
> batch.  I think this could work, but I’m worried about timing the SQL
> DELETE.  I kind of want the kill and the fill steps to happen in a single
> transaction.
>
>
>
> The other issue is what happens if PutSQL has to go down for a while due
> to database downtime and I get several kill-and-fill batches piled up.  Is
> there a way I can use backpressure to make sure only a single file gets
> converted from JSON to SQL at a time in order to avoid mixing batches?
>
> I also considered FlowFile expiration, but is there a way I can tell it
> NiFI to only expire a FlowFile when a new FlowFile has entered the queue?
> Ex: 1 flow file in queue, no expiration occurs. 2nd (newer) FlowFile
> enters queue then first file will expire itself.
>
>
>
> Thanks,
>
>   Peter
>

Re: Kill-and-Fill Pattern?

Posted by Lee Laim <le...@gmail.com>.
Hi Peter,

I'm making a few assumptions on batch sizes in the flow, mainly a single
flowfile (representing a batch) goes into JsonToSql, then many flowfiles
are produced.

An inline ExecuteStream Command processor placed before the PutSQL
processor may work.  The ESC would contain a script that would delete the
records from the old table, then pass the 'original' flow file to PutSQL.
This would be almost within the same transaction.  An additional validation
flow could help;  I could see the potential for database downtime occurring
mid-batch.

For the second issue (to prevent mixing of batches), you could set the
flowfile threshold count to "1" on the input to PutSQL.  This way back
pressure is on except when empty.  Additionally, a 1 minute run schedule on
the penultimate processor would further ensure you pull 1 flowfile at a
time.

The delete-first-flowfile is an interesting idea.
A putfile/getfile combo where new flowfiles overwrite older flowfiles of
the same filename to disk would accomplish this. I don't think there is
currently an in-memory version of this but...

putting on my Rube Goldberg hat, I'm envisioning 2 updateAttribute
processors and 1 RouteOnAttribute processor in a loop.  Leveraging
flowfile-priority (newest first) in the connections and RunScheduling on
the updateAttribute processors, one *could* engineer conditions where
non-new flowfiles do not get  an attribute "refreshed" in-time and are
subsequently routed out (or aged-out) of the loop.  A few new challenges
arise in this thought experiment, namely applying proper backpressure to
the loop as a whole, and failures due to heavily loaded instances taking
more time than expected.  I'll think about this a bit more, run some tests
and get back to you.

Thanks,
Lee



On Aug 26, 2016, at 7:45 AM, Peter Wicks (pwicks) <pw...@micron.com> wrote:

I have a source SQL table that I’m reading with a SQL select statement.  I
want to kill and fill a destination SQL table with this source data on an
interval.



My non kill-and-fill pattern is: ExecuteSQL -> Avro To JSON -> JSON To SQL
-> PutSQL.



I’m trying to come up with a good way to delete existing data first before
loading new data.

One option I’ve considered is to mark the original Avro file with a UUID
and add this attribute as a field in the destination table; then do a split
off, ReplaceText, and delete all rows where the UUID doesn’t match this
batch.  I think this could work, but I’m worried about timing the SQL
DELETE.  I kind of want the kill and the fill steps to happen in a single
transaction.



The other issue is what happens if PutSQL has to go down for a while due to
database downtime and I get several kill-and-fill batches piled up.  Is
there a way I can use backpressure to make sure only a single file gets
converted from JSON to SQL at a time in order to avoid mixing batches?

I also considered FlowFile expiration, but is there a way I can tell it
NiFI to only expire a FlowFile when a new FlowFile has entered the queue?
Ex: 1 flow file in queue, no expiration occurs. 2nd (newer) FlowFile enters
queue then first file will expire itself.



Thanks,

  Peter