You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Conrad Crampton <co...@SecData.com> on 2016/08/03 12:37:10 UTC

ExecuteSQL question

Hi,
My use case is that I want to ship a load of rows from an RDMS periodically and put in HDFS as Avro.
QueryTable processor has functionality that would be great i.e. maxcolumn value (there are couple of columns I could use for this from the data) and it is this functionality I am looking for, BUT the data is not from one single table. The nature of the RDBMS is that the business view on the data requires a bunch of joins from other tables/schemas to get the correct Avro file so the options I appear to have are

a)       Use QueryTable for each table that make up the business view and do the joins etc. in HDFS (Spark or something) – or potentially do the reconciliation within NiFi???

b)       Use ExecuteSQL to run the complete SQL to get the rows which can easily be put into HDFS as Avro given that the line will be the business (denormalised) data that is required.
The problem with a) is the reconciliation (denormalisation) of the data and the problem with b) is how to maintain the maxcolumn value so I only get the data since the last run.

In order to address b) can I use the DistrubutedMapCacheServer & Client to hold a key/value pair of last run date and extract from this date as a parameter?

Thanks for any suggestions.

Conrad


SecureData, combating cyber threats
______________________________________________________________________ 
The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT

Re: ExecuteSQL question

Posted by Conrad Crampton <co...@SecData.com>.
Thanks for the input on this.
As a follow-up this is what I have done…

Created a text file which contains single value of the lowest ID of the data I am retrieving, save on each node of cluster.
Then
GetFile (to read file)
ExtractText (to get the value – and set to attribute)
UpdateAttribute (to create another attribute which is the upper bound of the ID to use in sql i.e. lower >= ID < upper)
ExecuteSql (using the attribute as above as parameters) – using DBPool for connection
Branch1
	MergeContent
	PutHDFS
Branch2
	ReplaceText (replace flowfile content with upper attribute from above)
	PutFile (same filename as came in)

Rinse…Repeat

Appears to be working ok. Issue is that if ExecuteSQL fails, then as I have it now, the file is original deleted – probably better to get this moved so always have a backup of the last id used.

Conrad


On 03/08/2016, 15:34, "Yohann Lepage" <yo...@lepage.info> wrote:

    Hi,
    
    I have exactly the same use case to periodically get rows from some
    security appliances with just a read only access.
    
    Currently (without NiFi), we use an SQL query  to track the maximum
    value, depending on the DB/appliance/vendor, it could be a simple
    "SELECT getdate()" or  "select max(SW_TIME) from log_table" or  a more
    complex query with INNER JOIN.
    
    So, it would be great to have an option to customize the tracking query.
    
    Regards
    
    2016-08-03 16:02 GMT+02:00 Conrad Crampton <co...@secdata.com>:
    > Hi,
    > Thanks for this.
    > I did think about a MV but unfortunately I haven’t access to create views – just read access. That would have been my simplest option ;-) Life’s never that easy though is it?
    > The only part of the sql I need to be dynamic is the date parameter (I could even use the id column). Instead of using the MapCache (if that isn’t a good idea), could I use the GetFile to just pull a single txt file with the parameter (i.e. last run or max id value from the last run), which creates flowfile, read that value and pass that into ExecuteSql (using the aforementioned value as the parameter in the sql) as the select query can be dynamically constructed from attributes/ flowfile content (as per docs)? And then finally write text file back to file system to be picked up next time?
    > Thanks
    > Conrad
    >
    > On 03/08/2016, 14:02, "Matt Burgess" <ma...@gmail.com> wrote:
    >
    >     Conrad,
    >
    >     Is it possible to add a view (materialized or not) to the RDBMS? That
    >     view could take care of the denormalization and then
    >     QueryDatabaseTable could point at the view. The DB would take care of
    >     the push-down filters, which functionally is like if you had a
    >     QueryDatabaseTable for each table then did the joins.
    >
    >     In NiFi 1.0 there is a GenerateTableFetch processor which is like
    >     QueryDatabaseTable except it generates SQL instead of executing SQL.
    >     That might be used in your a) option above but you'd have to reconcile
    >     the SQL statements into a JOIN. A possible improvement to either/both
    >     processors would be to add attributes for the maximum value columns
    >     whose values are the maximum observed values. Then you wouldn't have
    >     to parse or manipulate the SQL if you really just want the max values.
    >
    >     I have been thinking about how QueryDatabaseTable and
    >     GenerateTableFetch would work if they accepted incoming flow files (to
    >     allow dynamic table names for example). It's a bit more tricky because
    >     those processors run without input to get max values, so their
    >     behavior would change when a flow file is present but would return to
    >     the original behavior if no flow file is present. Since the
    >     ListDatabaseTables processor is also in 1.0, it would be nice to use
    >     that as input to the other two processors.
    >
    >     I'm definitely interested in any thoughts or discussion around these things :)
    >
    >     Regards,
    >     Matt
    >
    >     On Wed, Aug 3, 2016 at 8:37 AM, Conrad Crampton
    >     <co...@secdata.com> wrote:
    >     > Hi,
    >     >
    >     > My use case is that I want to ship a load of rows from an RDMS periodically
    >     > and put in HDFS as Avro.
    >     >
    >     > QueryTable processor has functionality that would be great i.e. maxcolumn
    >     > value (there are couple of columns I could use for this from the data) and
    >     > it is this functionality I am looking for, BUT the data is not from one
    >     > single table. The nature of the RDBMS is that the business view on the data
    >     > requires a bunch of joins from other tables/schemas to get the correct Avro
    >     > file so the options I appear to have are
    >     >
    >     > a)       Use QueryTable for each table that make up the business view and do
    >     > the joins etc. in HDFS (Spark or something) – or potentially do the
    >     > reconciliation within NiFi???
    >     >
    >     > b)       Use ExecuteSQL to run the complete SQL to get the rows which can
    >     > easily be put into HDFS as Avro given that the line will be the business
    >     > (denormalised) data that is required.
    >     >
    >     > The problem with a) is the reconciliation (denormalisation) of the data and
    >     > the problem with b) is how to maintain the maxcolumn value so I only get the
    >     > data since the last run.
    >     >
    >     >
    >     >
    >     > In order to address b) can I use the DistrubutedMapCacheServer & Client to
    >     > hold a key/value pair of last run date and extract from this date as a
    >     > parameter?
    >     >
    >     >
    >     >
    >     > Thanks for any suggestions.
    >     >
    >     >
    >     >
    >     > Conrad
    >     >
    >     >
    >     >
    >     > SecureData, combating cyber threats
    >     >
    >     > ________________________________
    >     >
    >     > The information contained in this message or any of its attachments may be
    >     > privileged and confidential and intended for the exclusive use of the
    >     > intended recipient. If you are not the intended recipient any disclosure,
    >     > reproduction, distribution or other dissemination or use of this
    >     > communications is strictly prohibited. The views expressed in this email are
    >     > those of the individual and not necessarily of SecureData Europe Ltd. Any
    >     > prices quoted are only valid if followed up by a formal written quote.
    >     >
    >     > SecureData Europe Limited. Registered in England & Wales 04365896.
    >     > Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
    >     > Maidstone, Kent, ME16 9NT
    >
    >
    >      ***This email originated outside SecureData***
    >
    >     Click https://www.mailcontrol.com/sr/N2EHwWuz2JbGX2PQPOmvUiQSa3+T5MHvUCPvvNtzgx8r0bDDMR!duWMwCi4iTZgUfaZqve+Bl7WqQH8FWVwkvA==  to report this email as spam.
    >
    >
    
    
    
    -- 
    Yohann Lepage
    


Re: ExecuteSQL question

Posted by Yohann Lepage <yo...@lepage.info>.
Hi,

I have exactly the same use case to periodically get rows from some
security appliances with just a read only access.

Currently (without NiFi), we use an SQL query  to track the maximum
value, depending on the DB/appliance/vendor, it could be a simple
"SELECT getdate()" or  "select max(SW_TIME) from log_table" or  a more
complex query with INNER JOIN.

So, it would be great to have an option to customize the tracking query.

Regards

2016-08-03 16:02 GMT+02:00 Conrad Crampton <co...@secdata.com>:
> Hi,
> Thanks for this.
> I did think about a MV but unfortunately I haven’t access to create views – just read access. That would have been my simplest option ;-) Life’s never that easy though is it?
> The only part of the sql I need to be dynamic is the date parameter (I could even use the id column). Instead of using the MapCache (if that isn’t a good idea), could I use the GetFile to just pull a single txt file with the parameter (i.e. last run or max id value from the last run), which creates flowfile, read that value and pass that into ExecuteSql (using the aforementioned value as the parameter in the sql) as the select query can be dynamically constructed from attributes/ flowfile content (as per docs)? And then finally write text file back to file system to be picked up next time?
> Thanks
> Conrad
>
> On 03/08/2016, 14:02, "Matt Burgess" <ma...@gmail.com> wrote:
>
>     Conrad,
>
>     Is it possible to add a view (materialized or not) to the RDBMS? That
>     view could take care of the denormalization and then
>     QueryDatabaseTable could point at the view. The DB would take care of
>     the push-down filters, which functionally is like if you had a
>     QueryDatabaseTable for each table then did the joins.
>
>     In NiFi 1.0 there is a GenerateTableFetch processor which is like
>     QueryDatabaseTable except it generates SQL instead of executing SQL.
>     That might be used in your a) option above but you'd have to reconcile
>     the SQL statements into a JOIN. A possible improvement to either/both
>     processors would be to add attributes for the maximum value columns
>     whose values are the maximum observed values. Then you wouldn't have
>     to parse or manipulate the SQL if you really just want the max values.
>
>     I have been thinking about how QueryDatabaseTable and
>     GenerateTableFetch would work if they accepted incoming flow files (to
>     allow dynamic table names for example). It's a bit more tricky because
>     those processors run without input to get max values, so their
>     behavior would change when a flow file is present but would return to
>     the original behavior if no flow file is present. Since the
>     ListDatabaseTables processor is also in 1.0, it would be nice to use
>     that as input to the other two processors.
>
>     I'm definitely interested in any thoughts or discussion around these things :)
>
>     Regards,
>     Matt
>
>     On Wed, Aug 3, 2016 at 8:37 AM, Conrad Crampton
>     <co...@secdata.com> wrote:
>     > Hi,
>     >
>     > My use case is that I want to ship a load of rows from an RDMS periodically
>     > and put in HDFS as Avro.
>     >
>     > QueryTable processor has functionality that would be great i.e. maxcolumn
>     > value (there are couple of columns I could use for this from the data) and
>     > it is this functionality I am looking for, BUT the data is not from one
>     > single table. The nature of the RDBMS is that the business view on the data
>     > requires a bunch of joins from other tables/schemas to get the correct Avro
>     > file so the options I appear to have are
>     >
>     > a)       Use QueryTable for each table that make up the business view and do
>     > the joins etc. in HDFS (Spark or something) – or potentially do the
>     > reconciliation within NiFi???
>     >
>     > b)       Use ExecuteSQL to run the complete SQL to get the rows which can
>     > easily be put into HDFS as Avro given that the line will be the business
>     > (denormalised) data that is required.
>     >
>     > The problem with a) is the reconciliation (denormalisation) of the data and
>     > the problem with b) is how to maintain the maxcolumn value so I only get the
>     > data since the last run.
>     >
>     >
>     >
>     > In order to address b) can I use the DistrubutedMapCacheServer & Client to
>     > hold a key/value pair of last run date and extract from this date as a
>     > parameter?
>     >
>     >
>     >
>     > Thanks for any suggestions.
>     >
>     >
>     >
>     > Conrad
>     >
>     >
>     >
>     > SecureData, combating cyber threats
>     >
>     > ________________________________
>     >
>     > The information contained in this message or any of its attachments may be
>     > privileged and confidential and intended for the exclusive use of the
>     > intended recipient. If you are not the intended recipient any disclosure,
>     > reproduction, distribution or other dissemination or use of this
>     > communications is strictly prohibited. The views expressed in this email are
>     > those of the individual and not necessarily of SecureData Europe Ltd. Any
>     > prices quoted are only valid if followed up by a formal written quote.
>     >
>     > SecureData Europe Limited. Registered in England & Wales 04365896.
>     > Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
>     > Maidstone, Kent, ME16 9NT
>
>
>      ***This email originated outside SecureData***
>
>     Click https://www.mailcontrol.com/sr/N2EHwWuz2JbGX2PQPOmvUiQSa3+T5MHvUCPvvNtzgx8r0bDDMR!duWMwCi4iTZgUfaZqve+Bl7WqQH8FWVwkvA==  to report this email as spam.
>
>



-- 
Yohann Lepage

Re: ExecuteSQL question

Posted by Matt Burgess <ma...@gmail.com>.
For that approach I would think either the MapCache or the File would work. The trick will be getting the max value out of the flow file. After QueryDatabaseTable you could split the Avro and convert to JSON (or vice versa), then update the MapCache or File. I'm not sure the order of records is guaranteed, I recall an issue with the FIFO prioritizer when all the files have the same created date (which was fixed in 1.0, not sure about 0.7.x). An alternative is to use a scripting processor to guarantee that you only overwrite the value if it is greater than what's in the cache/file. I've got an an example in Groovy for reading the MapCache:
http://funnifi.blogspot.com/2016/04/inspecting-your-nifi.html

I'm not near my computer at the moment, does PutDistributedMapCache have an option to store only if greater than the existing value? If not, that might be a helpful improvement.

Regards,
Matt


> On Aug 3, 2016, at 10:02 AM, Conrad Crampton <co...@SecData.com> wrote:
> 
> Hi, 
> Thanks for this.
> I did think about a MV but unfortunately I haven’t access to create views – just read access. That would have been my simplest option ;-) Life’s never that easy though is it?
> The only part of the sql I need to be dynamic is the date parameter (I could even use the id column). Instead of using the MapCache (if that isn’t a good idea), could I use the GetFile to just pull a single txt file with the parameter (i.e. last run or max id value from the last run), which creates flowfile, read that value and pass that into ExecuteSql (using the aforementioned value as the parameter in the sql) as the select query can be dynamically constructed from attributes/ flowfile content (as per docs)? And then finally write text file back to file system to be picked up next time?
> Thanks
> Conrad
> 
> On 03/08/2016, 14:02, "Matt Burgess" <ma...@gmail.com> wrote:
> 
>    Conrad,
> 
>    Is it possible to add a view (materialized or not) to the RDBMS? That
>    view could take care of the denormalization and then
>    QueryDatabaseTable could point at the view. The DB would take care of
>    the push-down filters, which functionally is like if you had a
>    QueryDatabaseTable for each table then did the joins.
> 
>    In NiFi 1.0 there is a GenerateTableFetch processor which is like
>    QueryDatabaseTable except it generates SQL instead of executing SQL.
>    That might be used in your a) option above but you'd have to reconcile
>    the SQL statements into a JOIN. A possible improvement to either/both
>    processors would be to add attributes for the maximum value columns
>    whose values are the maximum observed values. Then you wouldn't have
>    to parse or manipulate the SQL if you really just want the max values.
> 
>    I have been thinking about how QueryDatabaseTable and
>    GenerateTableFetch would work if they accepted incoming flow files (to
>    allow dynamic table names for example). It's a bit more tricky because
>    those processors run without input to get max values, so their
>    behavior would change when a flow file is present but would return to
>    the original behavior if no flow file is present. Since the
>    ListDatabaseTables processor is also in 1.0, it would be nice to use
>    that as input to the other two processors.
> 
>    I'm definitely interested in any thoughts or discussion around these things :)
> 
>    Regards,
>    Matt
> 
>    On Wed, Aug 3, 2016 at 8:37 AM, Conrad Crampton
>    <co...@secdata.com> wrote:
>> Hi,
>> 
>> My use case is that I want to ship a load of rows from an RDMS periodically
>> and put in HDFS as Avro.
>> 
>> QueryTable processor has functionality that would be great i.e. maxcolumn
>> value (there are couple of columns I could use for this from the data) and
>> it is this functionality I am looking for, BUT the data is not from one
>> single table. The nature of the RDBMS is that the business view on the data
>> requires a bunch of joins from other tables/schemas to get the correct Avro
>> file so the options I appear to have are
>> 
>> a)       Use QueryTable for each table that make up the business view and do
>> the joins etc. in HDFS (Spark or something) – or potentially do the
>> reconciliation within NiFi???
>> 
>> b)       Use ExecuteSQL to run the complete SQL to get the rows which can
>> easily be put into HDFS as Avro given that the line will be the business
>> (denormalised) data that is required.
>> 
>> The problem with a) is the reconciliation (denormalisation) of the data and
>> the problem with b) is how to maintain the maxcolumn value so I only get the
>> data since the last run.
>> 
>> 
>> 
>> In order to address b) can I use the DistrubutedMapCacheServer & Client to
>> hold a key/value pair of last run date and extract from this date as a
>> parameter?
>> 
>> 
>> 
>> Thanks for any suggestions.
>> 
>> 
>> 
>> Conrad
>> 
>> 
>> 
>> SecureData, combating cyber threats
>> 
>> ________________________________
>> 
>> The information contained in this message or any of its attachments may be
>> privileged and confidential and intended for the exclusive use of the
>> intended recipient. If you are not the intended recipient any disclosure,
>> reproduction, distribution or other dissemination or use of this
>> communications is strictly prohibited. The views expressed in this email are
>> those of the individual and not necessarily of SecureData Europe Ltd. Any
>> prices quoted are only valid if followed up by a formal written quote.
>> 
>> SecureData Europe Limited. Registered in England & Wales 04365896.
>> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
>> Maidstone, Kent, ME16 9NT
> 
> 
>     ***This email originated outside SecureData***
> 
>    Click https://www.mailcontrol.com/sr/N2EHwWuz2JbGX2PQPOmvUiQSa3+T5MHvUCPvvNtzgx8r0bDDMR!duWMwCi4iTZgUfaZqve+Bl7WqQH8FWVwkvA==  to report this email as spam.
> 
> 

Re: ExecuteSQL question

Posted by Conrad Crampton <co...@SecData.com>.
Hi, 
Thanks for this.
I did think about a MV but unfortunately I haven’t access to create views – just read access. That would have been my simplest option ;-) Life’s never that easy though is it?
The only part of the sql I need to be dynamic is the date parameter (I could even use the id column). Instead of using the MapCache (if that isn’t a good idea), could I use the GetFile to just pull a single txt file with the parameter (i.e. last run or max id value from the last run), which creates flowfile, read that value and pass that into ExecuteSql (using the aforementioned value as the parameter in the sql) as the select query can be dynamically constructed from attributes/ flowfile content (as per docs)? And then finally write text file back to file system to be picked up next time?
Thanks
Conrad

On 03/08/2016, 14:02, "Matt Burgess" <ma...@gmail.com> wrote:

    Conrad,
    
    Is it possible to add a view (materialized or not) to the RDBMS? That
    view could take care of the denormalization and then
    QueryDatabaseTable could point at the view. The DB would take care of
    the push-down filters, which functionally is like if you had a
    QueryDatabaseTable for each table then did the joins.
    
    In NiFi 1.0 there is a GenerateTableFetch processor which is like
    QueryDatabaseTable except it generates SQL instead of executing SQL.
    That might be used in your a) option above but you'd have to reconcile
    the SQL statements into a JOIN. A possible improvement to either/both
    processors would be to add attributes for the maximum value columns
    whose values are the maximum observed values. Then you wouldn't have
    to parse or manipulate the SQL if you really just want the max values.
    
    I have been thinking about how QueryDatabaseTable and
    GenerateTableFetch would work if they accepted incoming flow files (to
    allow dynamic table names for example). It's a bit more tricky because
    those processors run without input to get max values, so their
    behavior would change when a flow file is present but would return to
    the original behavior if no flow file is present. Since the
    ListDatabaseTables processor is also in 1.0, it would be nice to use
    that as input to the other two processors.
    
    I'm definitely interested in any thoughts or discussion around these things :)
    
    Regards,
    Matt
    
    On Wed, Aug 3, 2016 at 8:37 AM, Conrad Crampton
    <co...@secdata.com> wrote:
    > Hi,
    >
    > My use case is that I want to ship a load of rows from an RDMS periodically
    > and put in HDFS as Avro.
    >
    > QueryTable processor has functionality that would be great i.e. maxcolumn
    > value (there are couple of columns I could use for this from the data) and
    > it is this functionality I am looking for, BUT the data is not from one
    > single table. The nature of the RDBMS is that the business view on the data
    > requires a bunch of joins from other tables/schemas to get the correct Avro
    > file so the options I appear to have are
    >
    > a)       Use QueryTable for each table that make up the business view and do
    > the joins etc. in HDFS (Spark or something) – or potentially do the
    > reconciliation within NiFi???
    >
    > b)       Use ExecuteSQL to run the complete SQL to get the rows which can
    > easily be put into HDFS as Avro given that the line will be the business
    > (denormalised) data that is required.
    >
    > The problem with a) is the reconciliation (denormalisation) of the data and
    > the problem with b) is how to maintain the maxcolumn value so I only get the
    > data since the last run.
    >
    >
    >
    > In order to address b) can I use the DistrubutedMapCacheServer & Client to
    > hold a key/value pair of last run date and extract from this date as a
    > parameter?
    >
    >
    >
    > Thanks for any suggestions.
    >
    >
    >
    > Conrad
    >
    >
    >
    > SecureData, combating cyber threats
    >
    > ________________________________
    >
    > The information contained in this message or any of its attachments may be
    > privileged and confidential and intended for the exclusive use of the
    > intended recipient. If you are not the intended recipient any disclosure,
    > reproduction, distribution or other dissemination or use of this
    > communications is strictly prohibited. The views expressed in this email are
    > those of the individual and not necessarily of SecureData Europe Ltd. Any
    > prices quoted are only valid if followed up by a formal written quote.
    >
    > SecureData Europe Limited. Registered in England & Wales 04365896.
    > Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
    > Maidstone, Kent, ME16 9NT
    
    
     ***This email originated outside SecureData***
    
    Click https://www.mailcontrol.com/sr/N2EHwWuz2JbGX2PQPOmvUiQSa3+T5MHvUCPvvNtzgx8r0bDDMR!duWMwCi4iTZgUfaZqve+Bl7WqQH8FWVwkvA==  to report this email as spam.
    


Re: ExecuteSQL question

Posted by Matt Burgess <ma...@gmail.com>.
Conrad,

Is it possible to add a view (materialized or not) to the RDBMS? That
view could take care of the denormalization and then
QueryDatabaseTable could point at the view. The DB would take care of
the push-down filters, which functionally is like if you had a
QueryDatabaseTable for each table then did the joins.

In NiFi 1.0 there is a GenerateTableFetch processor which is like
QueryDatabaseTable except it generates SQL instead of executing SQL.
That might be used in your a) option above but you'd have to reconcile
the SQL statements into a JOIN. A possible improvement to either/both
processors would be to add attributes for the maximum value columns
whose values are the maximum observed values. Then you wouldn't have
to parse or manipulate the SQL if you really just want the max values.

I have been thinking about how QueryDatabaseTable and
GenerateTableFetch would work if they accepted incoming flow files (to
allow dynamic table names for example). It's a bit more tricky because
those processors run without input to get max values, so their
behavior would change when a flow file is present but would return to
the original behavior if no flow file is present. Since the
ListDatabaseTables processor is also in 1.0, it would be nice to use
that as input to the other two processors.

I'm definitely interested in any thoughts or discussion around these things :)

Regards,
Matt

On Wed, Aug 3, 2016 at 8:37 AM, Conrad Crampton
<co...@secdata.com> wrote:
> Hi,
>
> My use case is that I want to ship a load of rows from an RDMS periodically
> and put in HDFS as Avro.
>
> QueryTable processor has functionality that would be great i.e. maxcolumn
> value (there are couple of columns I could use for this from the data) and
> it is this functionality I am looking for, BUT the data is not from one
> single table. The nature of the RDBMS is that the business view on the data
> requires a bunch of joins from other tables/schemas to get the correct Avro
> file so the options I appear to have are
>
> a)       Use QueryTable for each table that make up the business view and do
> the joins etc. in HDFS (Spark or something) – or potentially do the
> reconciliation within NiFi???
>
> b)       Use ExecuteSQL to run the complete SQL to get the rows which can
> easily be put into HDFS as Avro given that the line will be the business
> (denormalised) data that is required.
>
> The problem with a) is the reconciliation (denormalisation) of the data and
> the problem with b) is how to maintain the maxcolumn value so I only get the
> data since the last run.
>
>
>
> In order to address b) can I use the DistrubutedMapCacheServer & Client to
> hold a key/value pair of last run date and extract from this date as a
> parameter?
>
>
>
> Thanks for any suggestions.
>
>
>
> Conrad
>
>
>
> SecureData, combating cyber threats
>
> ________________________________
>
> The information contained in this message or any of its attachments may be
> privileged and confidential and intended for the exclusive use of the
> intended recipient. If you are not the intended recipient any disclosure,
> reproduction, distribution or other dissemination or use of this
> communications is strictly prohibited. The views expressed in this email are
> those of the individual and not necessarily of SecureData Europe Ltd. Any
> prices quoted are only valid if followed up by a formal written quote.
>
> SecureData Europe Limited. Registered in England & Wales 04365896.
> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
> Maidstone, Kent, ME16 9NT