You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by "Kuhfahl, Bob" <rk...@mitre.org> on 2018/08/09 14:51:56 UTC

Question about NiFi and bulk database inserts - is PutSQL the only out of the box option?

I’m trying to get bulk inserts going using PutSQL processor but it’s starting to get ugly so I need to reach out and see if any of you have been down this path.

If you have, here’s some info. If not, thanks for reading this far ☺

Background:
Legacy database migration ETL task. Extract from one database, do a bunch of transformations, then load it all into a postgresql repo.
We have 100’s of tables with obviously many record structures _and a ton of data_.

According to:
https://community.hortonworks.com/articles/91849/design-nifi-flow-for-using-putsql-processor-to-per.html

PutSQL, to do batch inserts, seems to want the form of the SQL statement to be identical for each record type.
e.g. Insert into Employee ("name", "job title") VALUES (?,?)

Easy enough to build that but then it needs attributes for all the values and types in the flow.
e.g.
1. sql.args.1.value = Bryan B
2. sql.args.2.value = Director
Use Update Attribute Processor to set sql.args.N.type Flow file attributes
1. sql.args.1.type = 12 (VARCHAR)
2. sql.args.2.type = 12

THIS implies my flow will need to create a couple attributes for every single field in the dataflow – AND I’ll have to come up with logic to determine what the data type is…

I’m a newbie at this nifi stuff but that really does _not_ feel like I’m going down a good path.
I’m hand-jamming a proof of concept just to validate the above, but having a hard time lining up the data types… (e.g. the database has a char(2) field; trying char, trying varchar, …)

The other SQL “insert-able” processors seem to want to read a file instead of a flow, but I could easily be missing something.
Suggestions would be appreciated!

Re: Question about NiFi and bulk database inserts - is PutSQL the only out of the box option?

Posted by Boris Tyukin <bo...@boristyukin.com>.

Bob,

yes, the performance difference is dramatic if you use bulk load
tools/techniques. x100 times is a true statement. I saw it myself with
Oracle, SQL Server and Sybase IQ. All vendors also have documented
practices how to do it in the most efficient way. Some suggest drop indexes
and FKs first and recreate after bulk load, use text files in specific
formats etc.

Hi Matt,

I think it would be very difficult to create something generic as you said.
All big ETL vendors created custom integrations to specific RDBMS to
support vendor-specific APIs and bulk insert tools and you will never beat
the performance of those proprietary interfaces.

I learned to love ELT approach, using staging tables and files and it works
really well with Hive/Impala/Sqoop/HDFS. We are on CDH but we use similar
techniques with external tables and moving files on HDFS. And NiFi helps us
coordinate all these things even though it is not quite a real-time use
case :)

Now I have another project in the pipeline to consume data from Kafka and
ingest into Kudu - this one will benefit from your suggestion to use
recordset processors and combine inserts into batches. I have not looked at
it yet and my target is not exactly RDBMS :)

I was afraid using NiFi for batch-like jobs, but I am loving it so far and
we have a few batch-like jobs running for some time now without problems. I
blogged how to run Sqoop from NiFi using Groovy
https://boristyukin.com/how-to-run-sqoop-from-nifi/

This was a sqoop import job, but export would be similar.

On Thu, Aug 9, 2018 at 11:36 AM Matt Burgess <ma...@apache.org> wrote:

> Boris,
>
> That is correct, PutDatabaseRecord uses PreparedStatements and statement
> batches according to the JDBC spec, rather than proprietary or
> vendor-specific solutions. The latter is very difficult to support and
> maintain; some support bulk insert SQL commands, some have external
> programs (each with their own syntax and behavior), etc.
>
> For the case where there is a better bulk insert option using a SQL
> dialect, we could consider a SQLRecordSetWriter. This could be leveraged by
> ConvertRecord to basically replace ConvertJSONToSQL, and it could ask the
> database adapter for dialect-specific SQL for bulk insert. Then if any
> additional changes need to be made to the SQL statements, you still have a
> chance to change them before sending to PutSQL.  However, if the database
> does not support a bulk insert option via SQL, then the SQLWriter could
> degrade to worse performance than the older ConvertJSONToSQL -> PutSQL
> option, since you wouldn't be able to leverage the attributes for values
> and types to be used in a PreparedStatement; they would just be a bunch of
> full INSERT lines in a flow file. I think that tradeoff makes the
> record-based SQL solution less attractive (and indeed, is what has kept me
> from working on a SQLWriter thus far).
>
> Per your comment about dumping the data somewhere like HDFS, I agree there
> are definitely situations in which you would rather put a lot of data in
> one place and run a bulk ingest tool, rather than having the data flow
> through NiFi. Often this happens when the interface to the target system
> does not support efficient ingest (such as INSERT statements in SQL).
> Before the Hive 3 bundle (the Hive Streaming API is supposed to be much
> faster/better than before), I would often put the data in HDFS and then
> create a table atop it using PutHiveQL. In fact, that's why I added a
> partial CREATE TABLE IF NOT EXISTS statement to the ConvertAvroToORC
> processor, so you could dump the ORC file to HDFS, then use ReplaceText to
> fill the content with the DDL statement, then send that to PutHiveQL.
> Other folks even leave the conversion to Hive, by setting up a raw external
> table by dumping the data, then doing a INSERT FROM SELECT at Hive to pull
> the data from the raw external table and convert it to a different format
> for the managed target table.
>
> I'm always interested in performance improvements we can make, especially
> in the RDBMS world, so I'm all ears for ideas :)
>
> Regards,
> Matt
>
>
>
>
>
> On Thu, Aug 9, 2018 at 11:22 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Matt, but it still not using bulk load methods, right?. Some databases
>> have proprietary ways of doing that fast rather than running a bunch of
>> insert statements.
>>
>> Bob, what I've done in the past is dumping data either to HDFS or local
>> disk and then using efficient tools to do this job using bulk load tools,
>> specific to your target platform. Sqoop can also do it to some degree but
>> has a long list of limitations then it comes down to exports.
>>
>>
>> On Thu, Aug 9, 2018 at 11:03 AM Matt Burgess <ma...@apache.org>
>> wrote:
>>
>>> Bob,
>>>
>>> Unless you already have SQL in your flow files, I always recommend
>>> PutDatabaseRecord [1] over PutSQL. The former is basically a mashup of
>>> ConvertJSONToSQL -> PutSQL, but takes in data in any format supported by
>>> our record readers (CSV, Avro, XML, JSON, etc.) and takes care of all the
>>> SQL generation (and prepared statement stuff) under the hood. You should
>>> find it a lot easier to work with, and a lot faster than the older
>>> alternative, especially since PutDatabaseRecord is able to deal with an
>>> entire set of rows/records in one flow file, rather than having to split up
>>> large CSV files, e.g. into individual rows to get individual SQL
>>> statements. If you try it out, please let us know if you run into any
>>> issues, I will do my best to help get you up and going.
>>>
>>> Regards,
>>> Matt
>>>
>>> [1]
>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.7.1/org.apache.nifi.processors.standard.PutDatabaseRecord/index.html
>>>
>>>
>>> On Thu, Aug 9, 2018 at 10:52 AM Kuhfahl, Bob <rk...@mitre.org> wrote:
>>>
>>>> I’m trying to get bulk inserts going using PutSQL processor but it’s
>>>> starting to get ugly so I need to reach out and see if any of you have been
>>>> down this path.
>>>>
>>>>
>>>>
>>>> If you have, here’s some info.  If not, thanks for reading this far ☺
>>>>
>>>>
>>>>
>>>> Background:
>>>>
>>>> Legacy database migration ETL task.  Extract from one database, do a
>>>> bunch of transformations, then load it all into a postgresql repo.
>>>>
>>>> We have 100’s of tables with obviously many record structures *_and a
>>>> ton of data_.*
>>>>
>>>>
>>>>
>>>> According to:
>>>>
>>>>
>>>> https://community.hortonworks.com/articles/91849/design-nifi-flow-for-using-putsql-processor-to-per.html
>>>>
>>>>
>>>>
>>>> PutSQL, to do batch inserts, seems to want the form of the SQL
>>>> statement to be identical for each record type.
>>>>
>>>> e.g. Insert into Employee ("name", "job title") VALUES (?,?)
>>>>
>>>>
>>>>
>>>> Easy enough to build that *but* then it needs attributes for all the
>>>> values and types in the flow.
>>>>
>>>> e.g.
>>>>
>>>> 1.  sql.args.1.value = Bryan B
>>>>
>>>> 2.  sql.args.2.value = Director
>>>>
>>>> Use Update Attribute Processor to set sql.args.N.type Flow file
>>>> attributes
>>>>
>>>> 1.  sql.args.1.type = 12 (VARCHAR)
>>>>
>>>> 2.  sql.args.2.type = 12
>>>>
>>>>
>>>>
>>>> THIS implies my flow will need to create a couple attributes for every
>>>> single field in the dataflow – AND I’ll have to come up with logic to
>>>> determine what the data type is…
>>>>
>>>>
>>>>
>>>> I’m a newbie at this nifi stuff but that really does _not_ feel like
>>>> I’m going down a good path.
>>>>
>>>> I’m hand-jamming a proof of concept just to validate the above, but
>>>> having a hard time lining up the data types… (e.g. the database has a
>>>> char(2) field; trying char, trying varchar, …)
>>>>
>>>>
>>>>
>>>> The other SQL “insert-able” processors seem to want to read a file
>>>> instead of a flow, but I could easily be missing something.
>>>>
>>>> Suggestions would be appreciated!
>>>>
>>>>
>>>>
>>>

Re: Question about NiFi and bulk database inserts - is PutSQL the only out of the box option?

Posted by Matt Burgess <ma...@apache.org>.

Boris,

That is correct, PutDatabaseRecord uses PreparedStatements and statement
batches according to the JDBC spec, rather than proprietary or
vendor-specific solutions. The latter is very difficult to support and
maintain; some support bulk insert SQL commands, some have external
programs (each with their own syntax and behavior), etc.

For the case where there is a better bulk insert option using a SQL
dialect, we could consider a SQLRecordSetWriter. This could be leveraged by
ConvertRecord to basically replace ConvertJSONToSQL, and it could ask the
database adapter for dialect-specific SQL for bulk insert. Then if any
additional changes need to be made to the SQL statements, you still have a
chance to change them before sending to PutSQL.  However, if the database
does not support a bulk insert option via SQL, then the SQLWriter could
degrade to worse performance than the older ConvertJSONToSQL -> PutSQL
option, since you wouldn't be able to leverage the attributes for values
and types to be used in a PreparedStatement; they would just be a bunch of
full INSERT lines in a flow file. I think that tradeoff makes the
record-based SQL solution less attractive (and indeed, is what has kept me
from working on a SQLWriter thus far).

Per your comment about dumping the data somewhere like HDFS, I agree there
are definitely situations in which you would rather put a lot of data in
one place and run a bulk ingest tool, rather than having the data flow
through NiFi. Often this happens when the interface to the target system
does not support efficient ingest (such as INSERT statements in SQL).
Before the Hive 3 bundle (the Hive Streaming API is supposed to be much
faster/better than before), I would often put the data in HDFS and then
create a table atop it using PutHiveQL. In fact, that's why I added a
partial CREATE TABLE IF NOT EXISTS statement to the ConvertAvroToORC
processor, so you could dump the ORC file to HDFS, then use ReplaceText to
fill the content with the DDL statement, then send that to PutHiveQL.
Other folks even leave the conversion to Hive, by setting up a raw external
table by dumping the data, then doing a INSERT FROM SELECT at Hive to pull
the data from the raw external table and convert it to a different format
for the managed target table.

I'm always interested in performance improvements we can make, especially
in the RDBMS world, so I'm all ears for ideas :)

Regards,
Matt

On Thu, Aug 9, 2018 at 11:22 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> Matt, but it still not using bulk load methods, right?. Some databases
> have proprietary ways of doing that fast rather than running a bunch of
> insert statements.
>
> Bob, what I've done in the past is dumping data either to HDFS or local
> disk and then using efficient tools to do this job using bulk load tools,
> specific to your target platform. Sqoop can also do it to some degree but
> has a long list of limitations then it comes down to exports.
>
>
> On Thu, Aug 9, 2018 at 11:03 AM Matt Burgess <ma...@apache.org> wrote:
>
>> Bob,
>>
>> Unless you already have SQL in your flow files, I always recommend
>> PutDatabaseRecord [1] over PutSQL. The former is basically a mashup of
>> ConvertJSONToSQL -> PutSQL, but takes in data in any format supported by
>> our record readers (CSV, Avro, XML, JSON, etc.) and takes care of all the
>> SQL generation (and prepared statement stuff) under the hood. You should
>> find it a lot easier to work with, and a lot faster than the older
>> alternative, especially since PutDatabaseRecord is able to deal with an
>> entire set of rows/records in one flow file, rather than having to split up
>> large CSV files, e.g. into individual rows to get individual SQL
>> statements. If you try it out, please let us know if you run into any
>> issues, I will do my best to help get you up and going.
>>
>> Regards,
>> Matt
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.7.1/org.apache.nifi.processors.standard.PutDatabaseRecord/index.html
>>
>>
>> On Thu, Aug 9, 2018 at 10:52 AM Kuhfahl, Bob <rk...@mitre.org> wrote:
>>
>>> I’m trying to get bulk inserts going using PutSQL processor but it’s
>>> starting to get ugly so I need to reach out and see if any of you have been
>>> down this path.
>>>
>>>
>>>
>>> If you have, here’s some info.  If not, thanks for reading this far ☺
>>>
>>>
>>>
>>> Background:
>>>
>>> Legacy database migration ETL task.  Extract from one database, do a
>>> bunch of transformations, then load it all into a postgresql repo.
>>>
>>> We have 100’s of tables with obviously many record structures *_and a
>>> ton of data_.*
>>>
>>>
>>>
>>> According to:
>>>
>>>
>>> https://community.hortonworks.com/articles/91849/design-nifi-flow-for-using-putsql-processor-to-per.html
>>>
>>>
>>>
>>> PutSQL, to do batch inserts, seems to want the form of the SQL statement
>>> to be identical for each record type.
>>>
>>> e.g. Insert into Employee ("name", "job title") VALUES (?,?)
>>>
>>>
>>>
>>> Easy enough to build that *but* then it needs attributes for all the
>>> values and types in the flow.
>>>
>>> e.g.
>>>
>>> 1.  sql.args.1.value = Bryan B
>>>
>>> 2.  sql.args.2.value = Director
>>>
>>> Use Update Attribute Processor to set sql.args.N.type Flow file
>>> attributes
>>>
>>> 1.  sql.args.1.type = 12 (VARCHAR)
>>>
>>> 2.  sql.args.2.type = 12
>>>
>>>
>>>
>>> THIS implies my flow will need to create a couple attributes for every
>>> single field in the dataflow – AND I’ll have to come up with logic to
>>> determine what the data type is…
>>>
>>>
>>>
>>> I’m a newbie at this nifi stuff but that really does _not_ feel like I’m
>>> going down a good path.
>>>
>>> I’m hand-jamming a proof of concept just to validate the above, but
>>> having a hard time lining up the data types… (e.g. the database has a
>>> char(2) field; trying char, trying varchar, …)
>>>
>>>
>>>
>>> The other SQL “insert-able” processors seem to want to read a file
>>> instead of a flow, but I could easily be missing something.
>>>
>>> Suggestions would be appreciated!
>>>
>>>
>>>
>>

Re: Question about NiFi and bulk database inserts - is PutSQL the only out of the box option?

Posted by Boris Tyukin <bo...@boristyukin.com>.

Matt, but it still not using bulk load methods, right?. Some databases have
proprietary ways of doing that fast rather than running a bunch of insert
statements.

Bob, what I've done in the past is dumping data either to HDFS or local
disk and then using efficient tools to do this job using bulk load tools,
specific to your target platform. Sqoop can also do it to some degree but
has a long list of limitations then it comes down to exports.


On Thu, Aug 9, 2018 at 11:03 AM Matt Burgess <ma...@apache.org> wrote:

> Bob,
>
> Unless you already have SQL in your flow files, I always recommend
> PutDatabaseRecord [1] over PutSQL. The former is basically a mashup of
> ConvertJSONToSQL -> PutSQL, but takes in data in any format supported by
> our record readers (CSV, Avro, XML, JSON, etc.) and takes care of all the
> SQL generation (and prepared statement stuff) under the hood. You should
> find it a lot easier to work with, and a lot faster than the older
> alternative, especially since PutDatabaseRecord is able to deal with an
> entire set of rows/records in one flow file, rather than having to split up
> large CSV files, e.g. into individual rows to get individual SQL
> statements. If you try it out, please let us know if you run into any
> issues, I will do my best to help get you up and going.
>
> Regards,
> Matt
>
> [1]
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.7.1/org.apache.nifi.processors.standard.PutDatabaseRecord/index.html
>
>
> On Thu, Aug 9, 2018 at 10:52 AM Kuhfahl, Bob <rk...@mitre.org> wrote:
>
>> I’m trying to get bulk inserts going using PutSQL processor but it’s
>> starting to get ugly so I need to reach out and see if any of you have been
>> down this path.
>>
>>
>>
>> If you have, here’s some info.  If not, thanks for reading this far ☺
>>
>>
>>
>> Background:
>>
>> Legacy database migration ETL task.  Extract from one database, do a
>> bunch of transformations, then load it all into a postgresql repo.
>>
>> We have 100’s of tables with obviously many record structures *_and a
>> ton of data_.*
>>
>>
>>
>> According to:
>>
>>
>> https://community.hortonworks.com/articles/91849/design-nifi-flow-for-using-putsql-processor-to-per.html
>>
>>
>>
>> PutSQL, to do batch inserts, seems to want the form of the SQL statement
>> to be identical for each record type.
>>
>> e.g. Insert into Employee ("name", "job title") VALUES (?,?)
>>
>>
>>
>> Easy enough to build that *but* then it needs attributes for all the
>> values and types in the flow.
>>
>> e.g.
>>
>> 1.  sql.args.1.value = Bryan B
>>
>> 2.  sql.args.2.value = Director
>>
>> Use Update Attribute Processor to set sql.args.N.type Flow file attributes
>>
>> 1.  sql.args.1.type = 12 (VARCHAR)
>>
>> 2.  sql.args.2.type = 12
>>
>>
>>
>> THIS implies my flow will need to create a couple attributes for every
>> single field in the dataflow – AND I’ll have to come up with logic to
>> determine what the data type is…
>>
>>
>>
>> I’m a newbie at this nifi stuff but that really does _not_ feel like I’m
>> going down a good path.
>>
>> I’m hand-jamming a proof of concept just to validate the above, but
>> having a hard time lining up the data types… (e.g. the database has a
>> char(2) field; trying char, trying varchar, …)
>>
>>
>>
>> The other SQL “insert-able” processors seem to want to read a file
>> instead of a flow, but I could easily be missing something.
>>
>> Suggestions would be appreciated!
>>
>>
>>
>

Re: Question about NiFi and bulk database inserts - is PutSQL the only out of the box option?

Posted by Matt Burgess <ma...@apache.org>.

Bob,

Unless you already have SQL in your flow files, I always recommend
PutDatabaseRecord [1] over PutSQL. The former is basically a mashup of
ConvertJSONToSQL -> PutSQL, but takes in data in any format supported by
our record readers (CSV, Avro, XML, JSON, etc.) and takes care of all the
SQL generation (and prepared statement stuff) under the hood. You should
find it a lot easier to work with, and a lot faster than the older
alternative, especially since PutDatabaseRecord is able to deal with an
entire set of rows/records in one flow file, rather than having to split up
large CSV files, e.g. into individual rows to get individual SQL
statements. If you try it out, please let us know if you run into any
issues, I will do my best to help get you up and going.

Regards,
Matt

[1]
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.7.1/org.apache.nifi.processors.standard.PutDatabaseRecord/index.html


On Thu, Aug 9, 2018 at 10:52 AM Kuhfahl, Bob <rk...@mitre.org> wrote:

> I’m trying to get bulk inserts going using PutSQL processor but it’s
> starting to get ugly so I need to reach out and see if any of you have been
> down this path.
>
>
>
> If you have, here’s some info.  If not, thanks for reading this far ☺
>
>
>
> Background:
>
> Legacy database migration ETL task.  Extract from one database, do a bunch
> of transformations, then load it all into a postgresql repo.
>
> We have 100’s of tables with obviously many record structures *_and a ton
> of data_.*
>
>
>
> According to:
>
>
> https://community.hortonworks.com/articles/91849/design-nifi-flow-for-using-putsql-processor-to-per.html
>
>
>
> PutSQL, to do batch inserts, seems to want the form of the SQL statement
> to be identical for each record type.
>
> e.g. Insert into Employee ("name", "job title") VALUES (?,?)
>
>
>
> Easy enough to build that *but* then it needs attributes for all the
> values and types in the flow.
>
> e.g.
>
> 1.  sql.args.1.value = Bryan B
>
> 2.  sql.args.2.value = Director
>
> Use Update Attribute Processor to set sql.args.N.type Flow file attributes
>
> 1.  sql.args.1.type = 12 (VARCHAR)
>
> 2.  sql.args.2.type = 12
>
>
>
> THIS implies my flow will need to create a couple attributes for every
> single field in the dataflow – AND I’ll have to come up with logic to
> determine what the data type is…
>
>
>
> I’m a newbie at this nifi stuff but that really does _not_ feel like I’m
> going down a good path.
>
> I’m hand-jamming a proof of concept just to validate the above, but having
> a hard time lining up the data types… (e.g. the database has a char(2)
> field; trying char, trying varchar, …)
>
>
>
> The other SQL “insert-able” processors seem to want to read a file instead
> of a flow, but I could easily be missing something.
>
> Suggestions would be appreciated!
>
>
>